REPORT  DOCUMENTATION  PAGE 


Form  Approved 
0MB  No.  0704-0188 


The  public  reportino  burden  for  this  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources, 
oathering  and  maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  Information.  Send  comments  regarding  this  burden  estirnate  or  any  other  aspect  of  this  collection 
of  information,  including  suggestions  for  reducing  the  burden,  to  Department  of  Defense,  Washington  Headquarters  Services,  Directorate  for  Infornriation  Operations  and  Reports 
(0704-0188),  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington,  VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be 
subject  to  any  penalty  for  failing  to  comply  with  a  collection  of  Information  If  it  does  not  display  a  currently  valid  0MB  control  number. 

PLEASE  DO  NOT  RETURN  YOUR  FORM  TO  THE  ABOVE  ADDRESS.  _ _ _ 


3.  DATES  COVERED  (From  -  To) 

OilO 


1.  REPORT  DATE  (DD-MM-YYYY) 

2.  REPORT  TYPE 

30-06-01 

Final  Technical 

ICilJSlIISiJ 


4.  TITLE  AND  SUBTITLE 

(FY97  AASERT)  REPRESENTJNG  AJ^D  SOLVING  AIR  CAM¬ 
PAIGN  PROBLEMS  AS  PARTIALLY  OBSERVABLE  MARKOV 
DECISION  PROBLEMS 


5b.  GRANT  NUMBER 

F49620-97-1-0477 


5c.  PROGRAM  ELEMENT  NUMBER 


6.  AUTHORCS) 

Thomas  L .  Dean 


5d.  PROJECT  NUMBER 


5e.  TASK  NUMBER 


5f.  WORK  UNIT  NUMBER 


7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Brown  University 

Providence,  RI  02912 


8.  PERFORMING  ORGANIZATION 
REPORT  NUMBER 


9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

AFOSR/NM 

801  North  Randlph  St,  Rm.  732 
Arlington,  VA  22203-1977 


10.  SPONSOR/MONITOR'S  ACRONYM(S) 

AFOSR 


11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 


UfS^RIBimON  STATEMENT  A: 
Approved  for  Public  Release 
Distributi 


12.  DISTRIBUTION/AVAILABILITY  ST  AT 

Publicly  available. 


13.  SUPPLEMENTARY  NOTES 


14.  ABSTRACT 

The  original  purpose  of  this  project  was  to  design  algorithms  and  architectures  for 
maintenance  and  deployment  scheduling  solutions  to  support  large-scale  strategic  mili¬ 
tary  airlift  activities  related  to  the  needs  of  the  Air  Mobility  Command  (AMC) ,  and 
secondarily  to  adapt  these  solutions  to  other  military  and  civilian  planning  and  sched¬ 
uling  problems.  The  research  adopted  a  stochastic  modeling  framework  and  used  novel 
techniques  for  planning  in  unpredictable  dynamic  environments  with  complex  state  and 
action  spaces.  Many  of  the  original  goals  of  the  project  were  achieved  in  conjunction 
with  the  primary  contract;  however,  several  extensions  were  carried  out  by  students 
receiving  funding  from  the  AASERT  supplementary  grant. 


WillJilHiiliicfSI 


15.  SUBJECT  TERMS 

Statistical  models,  planning,  scheduling 


16.  SECURITY  CLASSIFICATION  OF: 


b.  ABSTRACT 

c.  THIS  PAGE 

FIED 

UNC 

.ASSIFIED 

17.  LIMITATION  OF 
ABSTRACT 


18.  NUMBER  19a.  NAME  OF  RESPONSIBLE  PERSON 

pAQ£s  Thomas  L.  Dean _ 

19b.  TELEPHONE  NUMBER  (Include  area  code) 

401  863  7600 


Standard  Form  298  (Rev.  8/98) 

Prescribed  by  ANSI  Std.  Z39.18 


Final  Report 


*> 


The  original  purpose  of  this  project  was  to  design  algorithms  and  architectures  for 
maintenance  and  deployment  scheduling  solutions  to  support  large-scale  strategic  military 
airlift  activities  related  to  the  needs  of  the  Air  Mobility  Command  ( AMC) .  And  secondarily 
to  adapt  these  solutions  to  other  military  and  civilian  planning  and  scheduling  problems. 
The  research  adopted  a  stochastic-modeling  framework  and  made  use  of  novel  techniques 
for  planning  in  unpredictable,  dynamic  environments  with  complex  state  and  action  spaces. 

Many  of  the  goals  associated  with  the  project  were  achieved  in  conjunction  with  the 
primary  contract;  however,  several  extensions  were  carried  out  by  students  receiving  funding 
from  the  AASERT  supplementary  grant.  Originally,  Sonia  Leach,  a  graduate  student  at 
Brown,  was  to  be  the  primary  beneficiary  of  the  grant  and,  indeed,  Sonia  was  funded  by 
this  grant  for  one  year.  Later  Sonia’s  interest  turned  to  the  use  of  mathematically  related 
models  and  algorithms  that  were  targeted  at  problems  in  genomics  and  computational 
biology. 

Sonia  Leach  worked  with  researchers  at  the  National  Cancer  Institute  in  Bethesda, 
Maryland,  applying  machine  learning  techniques  to  analyze  biological  data.  She  began 
collaborations  with  researchers  at  University  of  Colorado  Health  Sciences  Center  in  Denver, 
Colorado  to  analyze  gene  expression  data.  She  received  NIH  funding  for  this  work  and  so, 
after  consultation  with  AFOSR,  the  remainder  of  the  AASERT  grant  was  applied  to  other 
students  working  on  projects  concerned  with  stochastic  modeling  methods. 

The  stochastic  models  that  were  at  the  heart  of  the  scheduling  and  planning  problems 
have  broad  application.  Sonia’s  work  in  genomics  is  a  good  example  but  there  are  also 
related  models  in  statistical  natural  language  processing  that  are  receiving  a  lot  of  attention 
recently.  This  grant  paid  supplemental  stipends  for  Niyu  Gee,  Keith  Hall,  and  Don  Blaheta 
for  their  work  on  solutions  to  language  learning  problems.  Niyu  Ge  completed  her  PhD 
dissertation  on  pronoun  anaphora  (finding  the  referents  (or  ’’antecedents”)  of  pronouns)  and 
has  now  taken  a  position  at  IBM  research.  Keith  and  Don  are  working  on  their  dissertations 
and  should  finish  in  the  coming  year  with  the  rest  of  their  funding  coming  from  NSF. 

Luis  Ortiz  also  received  funding  from  this  grant  and  his  research  is  directly  related  to 
the  combinatorial  problems  that  were  the  primanry  focus  of  this  AASERT  proposal.  Luis 
developed  new  sampling  methods  for  solving  influence  (decision)  diagrams  -  an  alternative 
representation  for  stochastic  planning  and  scheduling  problems.  He  provided  bounds  on 
the  number  of  samples  required  to  select  ’’good”  actions  with  high  probability  for  what 
was  considered  the  ’’traditional  sampling- method”  used.  He  proposed  a  new  method  that 
requires  fewer  samplass  (both  on  expectations  and  with  high  probability)  to  obtain  the  same 
results.  Luiz  will  complete  his  dissertation  and  defend  in  September  and  has  accepted  a 
postdoctoral  position  at  AT&T  labs. 
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Model  Reduction  Techniques  for  Computing  Approximately 
Optimal  Solutions  for  Markov  Decision  Processes 


Thomas  Dean  and  Robert  Givan  and  Sonia  Leach 
Department  of  Computer  Science,  Brown  University 
[tld,  rig,  sml]@cs.brown.edu 
http:  / /www.cs.brown.edu/people/ 


Abstract 

We  present  a  method  for  solving  implicit 
(factored)  Markov  decision  processes  (MDPs) 
with  very  large  state  spaces.  We  intro¬ 
duce  a  property  of  state  space  partitions 
which  we  call  e-homogeneity.  Intuitively, 
an  e-homogeneous  partition  groups  together 
states  that  behave  approximately  the  same 
under  all  or  some  subset  of  policies.  Borrow¬ 
ing  from  recent  work  on  model  minimization 
in  computer-aided  software  verification,  we 
present  an  algorithm  that  takes  a  factored 
representation  of  an  MDP  and  an  0  <  e  <  1 
and  computes  a  factored  e-homogeneous  par¬ 
tition  of  the  state  space. 

This  partition  defines  a  family  of  related 
MDPs — those  MDP’s  with  state  space  equal 
to  the  blocks  of  the  partition,  and  transition 
probabilities  “approximately”  like  those  of 
any  (original  MDP)  state  in  the  source  block. 
To  formally  study  such  families  of  MDPs, 
we  introduce  the  new  notion  of  a  “bounded 
parameter  MDP”  (BMDP),  which  is  a  fam¬ 
ily  of  (traditional)  MDPs  defined  by  speci¬ 
fying  upper  and  lower  bounds  on  the  transi¬ 
tion  probabilities  and  rewards.  We  describe 
algorithms  that  operate  on  BMDPs  to  find 
policies  that  are  approximately  optimal  with 
respect  to  the  original  MDP. 

In  combination,  our  method  for  reducing 
a  large  implicit  MDP  to  a  possibly  much 
smaller  BMDP  using  an  e-homogeneous  par¬ 
tition,  and  our  methods  for  selecting  actions 
in  BMDP’s  constitute  a  new  approach  for  an¬ 
alyzing  large  implicit  MDP’s.  Among  its  ad¬ 
vantages,  this  new  approach  provides  insight 
into  existing  algorithms  to  solving  implicit 
MDPs,  provides  useful  connections  to  work 
in  automata  theory  and  model  minimization, 
and  suggests  methods,  which  involve  vary¬ 
ing  e,  to  trade  time  and  space  (specifically  in 
terms  of  the  size  of  the  corresponding  state 
space)  for  solution  quality. 


1  Introduction 

Markov  decision  processes  (MDP)  provide  a  formal  ba¬ 
sis  for  representing  planning  problems  involving  uncer¬ 
tainty  [Boutilier  et  a/.,  1995a].  There  exist  algorithms 
for  solving  MDPs  that  are  polynomial  in  the  size  of 
the  state  space  [Puterman,  1994].  In  this  paper,  we 
are  interested  in  MDPs  in  which  the  states  are  spec¬ 
ified  implicitly  using  a  set  of  state  variables.  These 
MDPs  have  explicit  state  spaces  which  are  exponential 
in  the  number  of  state  variables,  and  are  typically  not 
amenable  to  direct  solution  using  traditional  methods 
due  to  the  size  of  the  explicit  state  space. 

It  is  possible  to  represent  some  MDPs  using  space 
polylog  in  the  size  of  the  state  space  by  factoring  the 
state-transition  distribution  and  the  reward  function 
into  sets  of  smaller  functions.  Unfortunately,  this  ef¬ 
ficiency  in  representation  need  not  translate  into  an 
efficient  means  of  computing  solutions.  In  some  cases, 
however,  dependency  information  implicit  in  the  fac¬ 
tored  representation  can  be  used  to  speed  computa¬ 
tion  of  an  optimal  policy  [Boutilier  and  Dearden,  1994, 
Boutilier  et  al,^  1995b,  Lin  and  Dean,  1995]. 

The  resulting  computational  savings  can  be  explained 
in  terms  of  finding  a  homogeneous  partition  of  the  state 
space — a  partition  such  that  states  in  the  same  block 
transition  with  the  same  probability  to  each  of  the 
other  blocks.  Such  a  partition  induces  a  smaller,  ex¬ 
plicit  MDP  whose  states  are  the  blocks  of  the  partition; 
the  smaller  MDP,  or  reduced  model  is  equivalent  to  the 
original  MDP  in  a  well  defined  sense.  It  is  possible 
to  take  an  MDP  in  factored  form  and  find  its  small¬ 
est  reduced  model  using  a  number  of  “partition  split¬ 
ting”  operations  polynomial  in  the  size  of  the  resulting 
model;  however,  these  splitting  operations  are  in  gen¬ 
eral  propositional  logic  operations  which  are  MV-haid 
and  are  thus  only  heuristically  effective.  The  states  of 
the  reduced  process  correspond  to  groups  of  states  (in 
the  original  process)  that  behave  the  same  under  all 
policies.  The  original  and  reduced  processes  are  equiv¬ 
alent  in  the  sense  that  they  yield  the  same  solutions, 
2.e.,  the  same  optimal  policies  and  state  values. 

The  basic  idea  of  computing  equivalent  reduced  pro- 


cesses  has  its  origins  in  automata  theory  [Hartmanis 
and  Stearns,  1966]  and  stochastic  processes  [Kemeny 
and  Snell,  I960]  and  has  surfaced  more  recently  in  the 
work  on  model  checking  in  computer-aided  verifica¬ 
tion  [Burch  et  ai^  1994][Lee  and  Yannakakis,  1992]. 
Building  on  the  work  of  Lee  and  Yannakakis  [1992], 
we  have  shown  [Dean  and  Givan,  1997]  that  several 
existing  algorithms  are  asymptotically  equivalent  to 
first  constructing  the  minimal  reduced  MDP  and  then 
solving  this  MDP  using  traditional  methods  that  op¬ 
erate  on  the  flat  (unfactored)  representations. 

The  minimal  model  may  be  exponentially  larger  than 
the  original  compact  MDP.  In  response  to  this  prob¬ 
lem,  this  paper  introduces  the  concept  of  an  e- 
homogeneous  partition  of  the  state  space.  This  re¬ 
laxation  of  the  concept  of  homogeneous  partition  al¬ 
lows  states  within  the  same  block  to  transition  with 
different  probabilities  to  other  blocks  so  long  as  the 
different  probabilities  are  within  c.  For  e  >  0, 
there  are  generally  e-homogeneous  partitions  which 
are  smaller  and  often  much  smaller  than  the  small¬ 
est  homogeneous  partition.  In  this  paper  we  discuss 
approximate  model  reduction — an  algorithm  for  find¬ 
ing  an  c-homogeneous  partition  of  a  factored  MDP 
which  is  generally  smaller  and  always  no  larger  than 
the  smallest  homogeneous  partition. 

Any  6-homogeneous  partition  induces  a  family  of  ex¬ 
plicit  MDPs,  each  with  state  space  equal  to  the  blocks 
of  the  partition,  and  transition  probabilities  from 
each  block  nearly  identical  to  those  of  the  underlying 
states.  To  formalize  and  analyze  such  families  we  in¬ 
troduce  the  new  concept  of  a  hounded  parameter  MDP 
(BMDP) — an  MDP  in  which  the  transition  proba- 
bilites  and  rewards  are  given  not  as  point  values  but 
as  closed  intervals.  In  Givan  et  aL  [l997],  we  describe 
algorithms  that  operate  on  BMDPs  to  produce  bounds 
on  value  functions  and  thereby  compute  approximately 
optimal  policies— we  summarize  these  methods  here. 
The  resulting  bounds  and  policies  apply  to  the  origi¬ 
nal  implicit  MDP.  Bounded  parameter  MDPs  general¬ 
ize  traditional  (exact)  MDPs  and  are  related  to  con¬ 
structs  found  in  work  on  aggregation  methods  for  solv¬ 
ing  MDPs  [Schweitzer,  1984,  Schweitzer  et  aL,  1985, 
Bertsekas  and  Castanon,  1989].  Although  BMDPs 
are  introduced  here  to  represent  approximate  aggre¬ 
gations,  they  are  interesting  in  their  own  right  and  are 
discussed  in  more  detail  in  [Givan  et  aL,  1997],  The 
model  reduction  algorithms  and  bounded  parameter 
MDP  solution  methods  can  be  combined  to  find  ap¬ 
proximately  optimal  solutions  to  large  factored  MDPs, 
varying  e  to  trade  time  and  space  for  solution  quality. 

The  remainder  of  this  paper  is  organized  as  follows.  In 
Section  2,  we  give  an  overview  of  the  algorithms  and 
representations  in  this  paper  and  discuss  how  they  fit 
together.  Section  3  reviews  traditional  and  factored 
MDPs  and  describes  the  generalization  to  bounded 
parameter  MDPs.  Section  4  describes  an  algorithm 
for  6-reducing  an  MDP  to  a  (possibly)  smaller  explicit 
BMDP  (an  MDP  if  e  =  0).  Section  5  summarizes 


our  methods  for  policy  selection  in  BMDPs,  and  ad¬ 
dresses  the  applicability  of  the  selected  policies  to  any 
MDP  which  e-reduces  to  the  analyzed  BMDP.  The  re¬ 
maining  sections  summarize  preliminary  experimental 
results  and  discuss  related  work. 

2  Overview 

Here  we  survey  and  relate  the  basic  mathematical  ob¬ 
jects  and  operations  defined  later  in  this  paper.  We 
start  with  a  Markov  decision  process  (MDP)  M  for 
which  we  would  like  to  compute  an  optimal  or  near 
optimal  policy.  Figure  l.a  depicts  the  MDP  M  as  a 
directed  graph  corresponding  to  the  state-transition 
diagram,  and  its  optimal  policy  as  found  by  tradi¬ 
tional  value  iteration. 

We  assume  that  the  state  space  for  M  (and  hence  the 
state-transition  graph)  is  quite  large.  We  therefore 
assume  that  the  states  of  M  are  encoded  in  terms  of 
state  variables  which  represent  aspects  of  the  state; 
an  assignment  of  values  to  all  of  the  state  variables 
constitutes  a  complete  description  of  a  state.  In  this 
paper,  we  assume  that  the  factored  representation  is  in 
the  form  of  a  Bayesian  network,  such  as  that  depicted 
in  Figure  l.b  with  four  state  variables  {A,  B,  C,  D}. 

We  speak  about  operations  involving  M,  but  in  prac¬ 
tice  all  operations  will  be  performed  symbolically  us¬ 
ing  the  factored  representation:  we  manipulate  sets 
of  states  represented  as  formulas  involving  the  state 
variables. 

Figure  l.c  and  Figure  l.d  depict  the  unique  smallest 
homogeneous  partition  of  the  state  space  of  M,  where 
the  blocks  are  represented  (respectively)  implicitly  and 
explicitly.  The  process  of  finding  this  partition  is  called 
(exact)  model  minimization.  Factored  model  mini¬ 
mization  involves  manipulating  boolean  formulas  and 
is  AfV-hard,  but  heuristic  manipulation  may  rarely 
achieve  this  worst  case. 

The  smallest  homogeneous  partition  may  be  exponen¬ 
tially  large,  so  we  seek  further  reduction  (at  a  cost 
of  only  approximately  optimal  solutions)  by  finding 
a  smaller  c-homogeneous  partition,  depicted  in  Fig¬ 
ure  l.e  and  Figure  l.f  where  the  blocks  are  again  rep¬ 
resented  (respectively)  implicitly  and  explicitly. 

Any  c-homogeneous  partition  can  be  used  to  create  a 
bounded  parameter  MDP,  shown  in  Figure  l.g  and  no¬ 
tated  as  M  — to  do  this,  we  treat  the  partition  blocks 
as  (aggregate)  states  and  summarize  everything  that 
we  know  about  transitions  between  blocks  in  terms  of 
closed  real  intervals  that  describe  the  variation  within 
a  block  of  the  transition  probabilities  to  other  blocks, 
i.e,,  for  any  action  and  pair  of  blocks,  we  record  the 
upper  and  lower  bounds  on  the  probability  of  start¬ 
ing  in  a  state  in  one  block  and  ending  up  in  the  other 
block.  ^ 


^The  BMDP  M  naturally  represents  a  family  of  MDPs, 


Figure  1:  The  basic  objects  and  operations  described  in  this  paper:  (a)  depicts  the  state-transition  diagram 
for  an  MDP  M  (only  a  single  action  is  shown),  (b)  depicts  a  Bayesian  network  as  an  example  of  a  symbolic 
representation  compactly  encoding  (c)  and  (d)  depict  the  smallest  homogeneous  partition  in  (respectively)  its 
implicit  (symbolic)  and  explicit  forms,  similarly,  (e)  and  (f)  depict  an  c-homogeneous  partition  in  its  implicit  and 
explicit  forms,  (g)  represents  the  bounded-parameter  MDP  summarizing  the  variations  in  the  e-homogeneous 
partition,  and,  finally,  (h),  (i),  and  (j)  depict  particular  (exact)  MDPs  from  the  family  of  MDPs  defined  by  M- 


Our  BMDP  analysis  algorithms  extract  particular 
MDPs  from  M  that  have  intuitive  characterizations. 
The  pessimistic  model  Mpes  is  the  MDP  within  M 
which  yields  the  lowest  optimal  value  at  every 

state.  It  is  a  theorem  that  Mpes  is  well-defined,  and 
that  at  each  state  in  Af  is  a  lower  bound  for  fol¬ 
lowing  the  optimal  policy  MDP  in  M  (as 

well  as  in  the  original  M  from  any  state  in  the  corre¬ 
sponding  block).  Similarly,  the  optimistic  model  Mopt 
has  the  best  value  function  Vmopi  •  ^Mopt  gives  upper- 
bounds  for  following  any  policy  in  Af.  In  summary, 
and  give  us  lower  and  upper  bounds  on 

the  optimal  value  function  we  are  really  interested  in, 
and  following  in  M  is  guaranteed  to  achieve 
at  least  the  lower  bound. 

Now,  armed  with  this  high-level  overview  to  serve  as 
a  road  map,  we  descend  into  the  details. 

3  Markov  Decision  Processes 

Exact  Markov  Decision  Processes  An  (exact) 
Markov  decision  process  Af  is  a  four  tuple  Af  = 
(Q^A^F^R)  where  Q  is  a  set  of  states,  ,4  is  a  set  of 
actions,  is  a  reward  function  that  maps  each  state 
to  a  real  value  R{q)^^  F  assigns  a  probability  to  each 
state  transition  for  each  action,  so  that  for  a  E  ^  and 
€  Q, 

Fp,{a)  =  Pr(At+i  =  =P.Ut  =  a) 

where  Xt  and  Ut  are  random  variables  denoting,  re¬ 
spectively,  the  state  and  action  at  time  t. 

A  policy  is  a  mapping  from  states  to  actions,  tt  :  Q  ^ 
A-  The  value  function  14, M  for  a  given  policy  maps 
states  to  their  expected  discounted  cumulative  reward 
given  that  you  start  in  that  state  and  act  according 
the  given  policy: 

Vn,M{p)  =  Rip)  +  7  5!^  fpqiHp))Vn,Miq) 
qeQ 

where  7  is  the  discount  rate^  0  <  7  <  1.  [Puterman, 
1994]. 

Bounded  Parameter  MDPs  A  bounded  parame¬ 
ter  MDP  (BMDP)  is  a  four  tuple  M  =  (Q,>1,  F,  R) 
where  Q  and  A  are  as  for  MDPs,  and  F  and  R  are 
analogous  to  the  MDP  F  and  R  but  yield  closed  real 
intervals  instead  of  real  values.  That  is,  for  any  action 
a  and  states  p^q^  F(p)  and  Fp^q{a)  are  both  closed 
real  intervals  of  the  form  [/,  u]  for  I  and  u  real  numbers 
with  0  <  /  <  u  <  1.  For  convenience,  we  define  F 

but  note  that  the  original  Af  is  not  generally  in  this  family. 
Nevertheless,  our  BMDP  algorithms  compute  policies  and 
value  boimds  which  can  be  soundly  applied  to  the  original 
Af. 

^The  techniques  and  results  in  this  paper  easily  gener¬ 
alize  to  more  general  reward  functions.  We  adopt  a  less 
general  formiilation  to  simplify  the  presentation. 


and  F  to  be  real  valued  functions  which  give  the  lower 
and  upper  bounds  of  the  intervals;  likewise  for  R  and 
F.  ^  To  ensure  that  F  admits  well-formed  transition 
functions,  we  require  that,  for^ariy  action  a  and  state 

P.  E96C  £p.9(“)  ^  ^  ^  E,ec  Rp,<ii°‘)- 

A  BMDP  M  —  (Q,>t,  F,  F)  defines  a  set  of  exact 
MDPs  Fm  -  {M\M  Af}  where  A4  |=  Af  iff 
Af  =  (Q,^,  F,  F)  and  F  and  F  satisfy  the  bounds 
provided  by  F  and  F  respectively.  We  will  write 
of  bounding  the  (optimal  or  policy  specific)  value  of  a 
state  in  a  BMDP — by  this  we  mean  providing  an  up¬ 
per  or  lower  bound  on  the  corresponding  state  value 
over  the  entire  family  of  MDPs  Tm-  For  a  more  thor¬ 
ough  treatment  of  BMDPs,  please  see  [Givan  et  a/., 
1997]. 

Factored  Representations  In  the  remainder  of 
this  paper,  we  make  use  of  Bayesian  networks  [Pearl, 
1988]  to  encode  implicit  (or  factored)  representa¬ 
tions;  however,  our  methods  apply  to  other  factored 
representations  such  as  probabilistic  STRIPS  opera¬ 
tors  [Kushmerick  et  al.^  1995].  Let  X  =  {A^i, . . . ,  Xm} 
be  a  set  of  state  variables.  We  assume  the  vari¬ 
ables  are  boolean,  and  refer  to  them  also  as  flu¬ 
ents.  We  represent  the  state  at  time  f  as  a  vector 
Xt  =  . . .,  ALm,f)  where  Xi^t  denotes  the  value  of 

the  zth  state  variable  at  time  t. 

The  state  transition  probabilities  can  be  represented 
using  Bayes  networks. 

A  two-stage  temporal  Bayesian  network  (2TBN)  is  a 
directed  acyclic  graph  consisting  of  two  sets  of  vari¬ 
ables  and  in  which  directed  arcs  in¬ 

dicating  dependence  are  allowed  from  the  variables  in 
the  first  set  to  variables  in  the  second  set  and  between 
variables  in  the  second  set. [Dean  and  Kanazawa,  1989] 
The  state-transition  probabilities  are  now  factored  as 

m 

Pi{Xt+i\Xt,Ut)  =  nPr(;^.-,f+i|Parents{X<,t+i),[/t) 

t  =  l 

where  Parents(A)  denotes  the  parents  of  X  in  the 
2TBN  and  each  of  the  conditional  probability  distri¬ 
butions  Pr(A',-,t-|.i|Parents(A',-,f+i),  f/t)  can  be  repre¬ 
sented  as  a  conditional  probability  table  or  as  a  de¬ 
cision  tree — we  choose  the  latter  in  this  paper  follow¬ 
ing  [Boutilier  et  a/.,  1995b].  We  enhance  the  2TBN 
representation  to  include  actions  and  reward  func¬ 
tions;  the  resulting  graph  is  called  an  influence  dia¬ 
gram  [Howard  and  Matheson,  1984]. 

Figure  2  illustrates  a  factored  representation  with 
three  state  variables,  X  =  {P,  <5, 5},  and  describes  the 
transition  probabilities  and  rewards  for  a  particular  ac¬ 
tion.  The  factored  form  of  the  transition  probabilities 


®To  simplify  the  remainder  of  the  paper,  we  assume 
that  the  reward  bounds  are  always  tight,  t.e.,  that  F  = 
F.  The  generalization  to  nontrivial  bounds  on  rewards  is 
strmghtforward. 


Figure  2:  A  factored  representation  with  three  state 
variables,  P,  Q  and  5,  and  reward  function  K 

A  Q  ^ 

~iP  A  -nj2  ^ 

(a)  (b) 

Figure  3:  Two  e-homogeneous  partitions  for  the  MDP 
described  in  Figure  2:  (a)  the  smallest  exact  homoge¬ 
neous  partition  (e  =  0)  and  (b)  a  smaller  partition  for 
6  =  0.05. 

is 

PT{Xt^i\XuUt)  =  Pr(Pt+i|Pt,Qt)-Pr(Qt+i)- 

Pr(5'f+i|5t,  Qt) 

where  in  this  case  Xt  =  (Pt,  Qt,  St)^ 

4  Model  Reduction  Methods 

In  this  section,  we  describe  a  family  of  algorithms  that 
take  as  input  an  MDP  and  a  real  value  e  between  0  and 
1  and  compute  a  bounded  parameter  MDP  where  each 
closed  real  interval  has  extent  less  than  or  equal  to  e. 
The  states  in  this  BMDP  correspond  to  the  blocks  of  a 
partition  of  the  state  space  in  which  states  in  the  same 
block  behave  approximately  the  same  with  respect  to 
the  other  blocks.  The  upper  and  lower  bounds  in  the 
BMDP  correspond  to  bounds  on  the  transition  prob¬ 
abilities  (to  other  blocks)  for  states  that  are  grouped 
together. 

We  first  define  the  property  sought  in  the  desired  state 
space  partition.  Let  P  =r  be  a  partition 

of  Q. 

Definition  1  A  partition  V  =  {Pi,...,Pn}  of  the 
state  space  of  an  MDP  M  has  the  property  of  e- 
approximate  stochastic  bisimulation  homogeneity  with 
respect  to  M  fore  such  that  0  <  c  <  1  if  and  only  if  for 
each  Bi^Bj  G  P,  for  each  a  £  Aj  for  each  p^q  E  Bi, 

lP(p)  -  R{q)\  <  e,  and 

^pr{o^)  |  <  6 


For  conciseness j  we  say  P  is  e-homogeneous."* 

Figure  3  shows  two  e-homogeneous  partitions  for  the 
MDP  described  in  Figure  2. 

We  now  explain  how  we  construct  an  e-homogeneous 
partition.  We  first  describe  the  relationship  between 
every  e-homogeneous  partition  and  a  particular  simple 
partition  based  on  immediate  reward. 

Definition  2  A  partition  P'  is  a  refinement  of  a  par¬ 
tition  P  if  and  only  if  each  block  of  P'  is  a  subset  of 
some  block  ofV;  in  this  case,  we  say  that  P  is  coarser 
than  P',  and  is  a  clustering  o/P' 

Definition  3  The  immediate  reward  partition  is  the 
partition  in  which  two  states,  p  and  q,  are  in  the  same 
block  if  and  only  if  they  have  the  same  reward. 

Definition  4  A  partition  P  is  e-uniform  with  respect 
to  a  function  /  :  Q  — >  P  if  for  every  two  states  p  and 
q  in  the  same  block  ofV,  \f{p)  —  f{q)\  < 

Every  e-homogeneous  partition  is  a  refinement  of  some 
e-uniform  clustering  (with  respect  to  reward)  of  the 
immediate  reward  partition.  Our  algorithm  starts  by 
constructing  an  e-uniform  reward  clustering  Vo  of  the 
immediate  reward  partition.^  We  then  refine  this  ini¬ 
tial  partition  by  splitting®  blocks  repeatedly  to  achieve 
e-homogeneity.  We  can  decide  which  blocks  are  can¬ 
didates  for  splitting  using  the  following  local  property 
of  the  blocks  of  an  e-homogenous  partition: 

Definition  5  We  say  that  a  block  C  of  a  partition  P 
is  e-stable  with  respect  to  a  block  B  iff  for  all  actions 
a  and  all  states  p  GC  and  q  G  C  we  have 

<e 

rGB  r€B 

We  say  that  C  is  e-stable  if  C  is  e-stable  with  respect 
to  every  block  ofV  and  action  in  A. 

The  definitions  immediately  imply  that  a  partition  is  e- 
homogenous  iff  every  block  in  the  partition  is  e-stable. 

The  model  e-reduction  algorithm  simply  checks  each 
block  for  e-stability,  splitting  unstable  blocks  until  qui¬ 
escence,  i.e.,  until  there  are  no  unstable  blocks  left  to 
split.  Specifically,  when  a  block  C  is  found  to  be  unsta¬ 
ble  with  respect  to  a  block  B,  we  replace  C  in  the  par¬ 
tition  by  a  set^  of  sub-blocks  Ci, . . . ,  Cjt  such  that  each 

“^For  the  case  of  e  =  0,  e-approximate  stochastic  bisim¬ 
ulation  homogeneity  is  closely  related  to  the  substitution 
property  for  finite  automata  developed  by  Hartmanis  and 
Steams  [1966]  and  the  notion  of  lumpahility  for  Markov 
chains  [Kemeny  and  Snell,  I960]. 

^There  may  be  many  such  clusterings,  we  currently 
choose  a  coarsest  one  arbitrarily. 

®The  term  splitting  refers  to  the  process  whereby  a  block 
of  a  partition  is  divided  into  two  or  more  sub-blocks  to 
obtain  a  refinement  of  the  original  partition. 

^There  may  be  more  than  one  choice,  as  discussed 
below. 
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Figure  4:  Clustering  sub  blocks  that  behave  approxi¬ 
mately  the  same.  With  e  =  0.01  there  are  two  smallest 
clusterings. 


Ci  is  a  maximal  sub-block  of  C  that  is  e-stable  with  re¬ 
spect  to  B.  Note  that  at  all  times  the  blocks  of  the  par¬ 
tition  are  represented  in  factored  form,  e.g.,  as  DNF 
formulas  over  the  state  variables.  The  block  splitting 
operation  manipulates  these  factored  representations, 
not  explicit  states.  This  method  is  an  extension  to 
Markov  decision  processes  of  the  deterministic  model 
reduction  algorithm  of  Lee  and  Yannakakis  [1992]. 

If  e  :=  0,  the  above  description  fully  defines  the 
block  splitting  operation,  as  there  exists  a  unique  set 
of  maximal,  stable  sub-blocks.  Furthermore,  in  this 
case,  the  algorithm  finds  the  unique  smallest  homo¬ 
geneous  partition,  independent  of  the  order  in  which 
unstable  blocks  are  split.  We  call  this  partition  the 
minimal  model  (we  also  use  this  term  to  refer  to  the 
MDP  derived  from  this  partition  by  treating  its  blocks 
as  states). 

However,  if  e  >  0,  then  we  may  have  to  choose  among 
several  possible  ways  of  splitting  C  as  shown  in  the 
following  example.  Figure  4  depicts  a  block,  C,  and 
two  other  blocks,  B  and  B',  such  that  states  in  C 
transition  to  states  in  B  and  B'  under  some  action  a. 
We  partition  C  into  three  sub  blocks  {Ci,  C2,  C3}  such 
that  states  in  each  sub  block  have  the  same  transition 
probabilities  with  respect  to  a,  B,  and  B'.  In  building 
an  0.01-approximate  model,  we  might  replace  C  by  the 
two  blocks  Cl  and  C2UC3,  or  by  the  two  blocks  C3  and 
Cl  U  C2;  it  is  possible  to  construct  examples  in  which 
each  of  these  is  the  most  appropriate  choice  because 
the  splits  of  other  blocks  induced  later®.  We  require 
only  that  the  clustering  selected  is  not  the  refinement 
of  another  c-uniform  clustering,  i.e.,  that  it  is  as  coarse 
as  possible. 

Because  we  make  the  clustering  decisions  arbitrarily, 
our  algorithm  does  not  guarantee  finding  the  smallest 
e-homogenous  partition  when  c  >  0,  nor  that  the  par¬ 
tition  found  for  ci  will  be  smaller  (or  even  as  small)  as 

*The  result  is  additionally  sensitive  to  the  order  in 
which  unstable  blocks  are  split — splitting  one  e-imstable 
block  may  make  another  become  c-stable. 


the  partition  found  for  62  <  Ci,  However,  it  is  a  the¬ 
orem  that  the  partition  found  will  be  no  larger  than 
the  unique  smallest  0-homogenous  partition. 

Theorem  1  For  e  >  0,  the  partition  found  by  model 
c-reduction  using  any  clustering  technique  is  coarser 
than,  and  thus  no  larger  than  the  minimal  model. 

Theorem  2  For  0  <  (7  <  smallest  ci- 

homogenous  partition  is  no  larger  than  the  smallest 
€2‘homogenous  partition.  The  model  e-reduction  algo¬ 
rithm,  augmented  by  an  (impractical)  search  over  all 
clustering  decisions,  will  find  these  smallest  partitions. 

Theorem  3  Given  a  bound  and  an  MDP  whose 
smallest  e-homogenous  partition  is  polynomial  in  size, 
the  problem  of  determining  whether  there  exists  an  e- 
homogenous  partition  of  size  no  more  than  the  bound 
is  NP-complete. 

These  theorems  imply  that  using  an  c  >  0  can  only 
help  us,  but  that  our  methods  may  be  sensitive  to  just 
which  e  we  choose,  and  are  necessarily  heuristic. 

Currently  our  implementation  uses  a  greedy  cluster¬ 
ing  algorithm;  in  the  future  we  hope  to  incorporate 
more  sophisticated  techniques  from  the  learning  and 
pattern  recognition  literature  to  find  a  smaller  cluster¬ 
ing  locally  within  each  SPLIT  operation  (though  this 
does  not  guarantee  a  smaller  final  partition). 

Each  e-homogenous  partition  V  of  an  MDP  M  = 
{Q,A,F,R)  induces  a  corresponding  BMDP  Mv  — 
{Q,A,F,R)  in  a  straightforward  manner.  The  states 
Mv  are  just  the  blocks  of  7^  and  the  actions  are  the 
same  as  those  in  M.  The  reward  and  transition  func¬ 
tions  are  defined  to  give  intervals  bounding  the  pos¬ 
sible  reward  and  block  transition  probabilities  within 
each  block:  for  blocks  B  and  C  and  action  a, 

R{B)  =  [  minpgB  R{p),  maxpgs  R{p)  ] 

h,c{o‘)  =[  minpee  Egec 

maXpgB  -Pp,?!")  ] 

We  can  then  use  the  methods  in  the  next  section  to 
give  intervals  bounding  the  optimal  value  of  each  state 
in  Mv  and  select  a  policy  which  guarantees  achieving 
at  least  the  lower  bound  value  at  each  state.  The  fol¬ 
lowing  theorem  then  implies  the  value  bounds  apply 
to  the  states  in  M,  and  are  achieved  or  exceeded  by 
following  the  corresponding  policy  in  M. 

We  first  note  that  any  function  on  the  blocks  of  V 
can  be  extended  to  a  function  on  the  states  of  M:  for 
each  state  we  return  the  value  assigned  to  the  block  of 
V  in  which  it  falls.  In  this  manner,  we  can  interpret 
the  value  bounds  and  policies  for  Mv  ^  bounds  and 
policies  for  M. 

Theorem  4  For  any  MDP  M  and  e-homogenous  par¬ 
tition  V  of  the  states  of  M,  sound  (optimal  or  policy 


specific)  value  bounds  for  Mp  apply  also  to  M  (by 
extending  the  policy  and  value  functions  to  the  state 
space  of  M  according  to  V)- 

5  Interval  Value  Iteration 

We  have  developed  a  variant  of  the  value  iteration  al¬ 
gorithm  for  computing  the  optimal  policy  for  exact 
MDPs[Bellman,  1957]  that  operates  on  bounded  pa¬ 
rameter  MDPs.  A  BMDP  M  represents  a  family  of 
MDPs  Tm’!  implying  some  degree  of  uncertainty  as  to 
which  MDP  in  the  family  actions  will  actually  be  taken 
in.  As  such,  there  is  no  specific  value  for  following  a 
policy  from  a  start  state — rather,  there  is  a  window  of 
possible  values  for  following  the  policy  in  the  different 
MDPs  of  the  family.  Similarly,  for  each  state  there  is 
a  window  of  possible  optimal  values  over  the  MDPs  in 
the  family  Tm*  Out  algorithm  can  compute  bounds 
on  policy  specific  value  functions  as  well  as  bounds  on 
the  optimal  value  function.  We  have  also  shown  how 
to  extract  from  these  bounds  a  specific  ‘‘optimal”  pol¬ 
icy  which  is  guaranteed  to  achieve  at  least  the  lower 
bound  value  in  any  actual  MDP  from  the  family  !Fm 
defined  by  the  BMDP.  We  call  this  policy  TTpes,  the 
pessimistic  optimal  policy. 

We  call  this  algorithm,  interval  value  iteration  {IV I 
for  optimal  values,  and  IV I for  policy  specific  val¬ 
ues).  The  algorithm  is  based  on  the  fact  that,  if  we 
only  knew  the  rank  ordering  of  the  states’  values,  we 
would  easily  be  able  to  select  an  MDP  from  the  fam¬ 
ily  which  minimized  or  maximized  those  values, 
and  then  compute  the  values  using  that  MDP.  Since 
we  don’t  know  the  rank  ordering  of  states’  values,  the 
algorithm  uses  the  ordering  of  the  current  estimates  of 
the  values  to  select  a  minimizing  (maximizing)  MDP 
from  the  family,  and  performs  one  iteration  of  stan¬ 
dard  value  iteration  on  that  MDP  to  get  new  value 
estimates.  These  new  estimates  can  then  be  used  to 
select  a  new  minimizing  (maximizing)  MDP  for  the 
next  iteration,  and  so  forth. 

Bounded  parameter  MDPs  are  interesting  objects  and 
we  explore  them  at  greater  length  in  [Givan  et  al,^ 
1997].  In  that  paper,  we  prove  the  following  results 
about  IV I, 

Theorem  5  Given  a  BMDP  M  and  a  specific  pol¬ 
icy  TT,  IV I converges  at  each  state  to  lower  and  up¬ 
per  bounds  on  the  value  of  tt  at  that  state  over  all  the 
MDPs  in  Tm  - 

Theorem  6  Given  a  BMDP  M,  IV I  converges  at 
each  state  to  lower  and  upper  bounds  on  the  optimal 
value  of  that  state  over  all  the  MDPs  in  Tm* 

Theorem  7  Given  a  BMDP  M,  the  policy  TTpes  ex¬ 
tracted  by  assuming  that  states  actual  values  are  the 
IV I -converged  lower  bounds  has  a  policy  specific  lower 
bound  (from  IV In)  in  M  equal  to  the  (non  policy  spe¬ 
cific)  IVI-converged  lower  bound.  No  other  policy  has 


a  higher  policy  specific  lower  bound, 

6  Related  Work  and  Discussion 

This  paper  combines  a  number  of  techniques  to  address 
the  problem  of  solving  (factored)  MDPs  with  very 
large  states  spaces.  The  definition  of  e-homogeneity 
and  the  model  reduction  algorithms  for  finding  e- 
homogeneous  partitions  are  new,  but  draw  on  tech¬ 
niques  from  automata  theory  and  symbolic  model 
checking.  Burch  et  al,  [1994]  is  the  standard  refer¬ 
ence  on  symbolic  model  checking  for  computer-aided 
design.  Our  reduction  algorithm  and  its  analysis  were 
motivated  by  the  work  of  Lee  and  Yannakakis  [1992] 
and  Bouajjani  et  al,  [1992]. 

The  notion  of  bounded-parameter  MDP  is  also  new, 
but  is  related  to  aggregation  techniques  used  to  speed 
convergence  in  iterative  algorithms  for  solving  exact 
MDPs.  Bertsekas  and  Castanon  [1989]  use  the  notion 
of  aggregated  Markov  chains  and  consider  grouping 
together  states  with  approximately  the  same  residuals 
{i,€,,  difference  in  the  estimated  value  function  from 
one  iteration  to  the  next  during  value  iteration). 

The  methods  for  manipulating  factored  representa¬ 
tions  of  MDPs  were  largely  borrowed  from  Boutilier  et 
al,  [l995b],  which  provides  an  iterative  algorithm  for 
finding  optimal  solutions  to  factored  MDPs.  Dean 
and  Givan  [1997]  describe  a  model-minimization  algo¬ 
rithm  for  solving  factored  MDPs  which  is  asymptot¬ 
ically  equivalent  to  the  algorithm  in  [Boutilier  et  al,^ 
1995b]. 

Boutilier  and  Dearden  [?]  extend  the  work  in  [Boutilier 
et  al,,  1995b]  to  compute  approximate  solutions  to  fac¬ 
tored  MDPs  by  associating  upper  and  lower  bounds 
with  symbolically  represented  blocks  of  states.  States 
are  aggregated  if  they  have  approximately  the  same 
value  rather  than  if  they  behave  approximately  the 
same  behavior  under  all  or  some  set  of  policies,  though 
it  often  turns  out  that  states  with  nearly  the  same 
value  have  nearly  the  same  dynamics. 

There  are  two  significant  differences  between  our  ap¬ 
proximation  techniques  and  those  of  Boutilier  and 
Dearden.  First,  we  partition  the  state  space  and 
then  perform  interval  value  iteration  on  the  resulting 
bounded-parameter  MDP,  while  Boutilier  and  Dear¬ 
den  repeatedly  partition  the  state  space.  Second,  we 
use  a  fixed  e  for  computing  a  partition  while  Boutilier 
and  Dearden,  like  Bertsekas  and  Castanon,  repartition 
the  state  space  (if  necessary)  on  each  iteration  on  the 
basis  of  the  current  residuals,  and,  hence,  (effectively) 
they  use  different  c’s  at  different  times  and  on  different 
portions  of  the  state  space.  Despite  these  differences, 
we  conjecture  that  the  two  algorithms  perform  asymp¬ 
totically  the  same.  Practically  speaking,  we  expect 
that  in  some  cases,  repeatedly  and  adaptively  comput¬ 
ing  partitions  may  provide  better  performance,  while 
in  other  cases,  performing  the  partition  once  and  for 
all  may  result  in  a  computational  advantage. 


We  have  written  a  prototype  implementation  of  the 
model  reduction  algorithms  described  in  this  paper, 
along  with  the  BMDP  evaluation  algorithms  (IVI)  re¬ 
ferred  to.  Using  this  implementation  we  have  been  able 
to  demonstrate  substantial  reductions  in  model  size, 
and  increasing  reductions  with  increasing  c.  However, 
the  MDPs  we  have  been  reducing  are  still  “toy”  prob¬ 
lems  and  while  they  were  not  concocted  expressly  to 
make  the  algorithm  look  good,  these  empirical  results 
are  still  of  questionable  value.  Further  research  is  nec¬ 
essary  before  these  techniques  are  adequate  to  handle 
a  real-world  large  scale  planning  problem  in  order  to 
give  convincing  empirical  data. 

Finally,  we  believe  that  by  formalizing  the  notions 
of  approximately  similar  behavior,  approximately 
equivalent  models,  and  families  of  closely  related 
MDPs  the  mathematical  entities  corresponding  to  e- 
homogeneous  partitions,  c-reductions,  and  bounded- 
parameter  MDPs  provide  valuable  insight  into  fac¬ 
tored  MDPs  and  the  prospects  for  solving  them  ef¬ 
ficiently. 

References 

[Bellman,  1957]  Bellman,  Richard  1957.  Dynamic 
Programming.  Princeton  University  Press. 

[Bertsekas  and  Castanon,  1989]  Bertsekas,  D.  P.  and 
Castanon,  D.  A.  1989.  Adaptive  aggregation  for  in¬ 
finite  horizon  dynamic  programming.  IEEE  Trans¬ 
actions  on  Automatic  Control  34(6):589-598. 

[Bouajjani  et  aL,  1992]  Bouajjani,  A.;  Fernandez,  J.- 
C.;  Halbwachs,  N.;  Raymond,  P.;  and  Ratel,  C. 
1992.  Minimal  state  graph  generation.  Science  of 
Computer  Programming  18:247-269. 

[Boutilier  and  Dearden,  1994]  Boutilier,  Craig  and 
Dearden,  Richard  1994.  Using  abstractions  for  de¬ 
cision  theoretic  planning  with  time  constraints.  In 
Proceedings  AAAl-94-  AAAI.  1016-1022. 

[Boutilier  et  aL,  1995a] 

Boutilier,  Craig;  Dean,  Thomas;  and  Hanks,  Steve 
1995a.  Planning  under  uncertainty:  Structural  as¬ 
sumptions  and  computational  leverage.  In  Proceed¬ 
ings  of  the  Third  European  Workshop  on  Planning. 

[Boutilier  et  al.^  1995b]  Boutilier,  Craig;  Dearden, 
Richard;  and  Goldszmidt,  Moises  1995b.  Exploit¬ 
ing  structure  in  policy  construction.  In  Proceedings 
IJCAIll  IJCAH.  1104-1111. 

[Burch  et  a/.,  1994]  Burch,  Jerry;  Clarke,  Ed¬ 
mund  M.;  Long,  David;  McMillan,  Kenneth  L.;  and 
Dill,  David  L.  1994.  Symbolic  model  checking  for 
sequential  circuit  verification.  IEEE  Transactions 
on  Computer  Aided  Design  13(4):401-424. 

[Dean  and  Givan,  1997]  Dean,  Thomas  and  Givan, 
Robert  1997.  Model  minimization  in  Markov  de¬ 
cision  processes.  In  Proceedings  AAAL97.  AAAI. 

[Dean  and  Kanazawa,  1989]  Dean, 

Thomas  and  Kanazawa,  Keiji  1989.  A  model  for 


reasoning  about  persistence  and  causation.  Compu¬ 
tational  Intelligence  5(3):142-150. 

[Dean  et  al.^  1995]  Dean,  Thomas;  Kaelbling,  Leslie; 
Kirman,  Jak;  and  Nicholson,  Ann  1995.  Planning 
under  time  constraints  in  stochastic  domains.  Arti¬ 
ficial  Intelligence  76(l-2):35-74. 

[Givan  et  a/.,  1997]  Givan,  Robert;  Leach,  Sonia;  and 
Dean,  Thomas  1997.  Bounded  parameter  markov 
decision  processes.  Technical  Report  CS-97-05, 
Brown  University,  Providence,  Rhode  Island. 

[Hartmanis  and  Stearns,  1966] 

Hartmanis,  J.  and  Stearns,  R.  E.  1966.  Algebraic 
Structure  Theory  of  Sequential  Machines.  Prentice- 
Hall,  Englewood  Cliffs,  N.J. 

[Howard  and  Matheson,  1984]  Howard,  Ronald  A. 
and  Matheson,  James  E.  1984.  Influence  diagrams. 
In  Howard,  Ronald  A.  and  Matheson,  James  E.,  ed¬ 
itors  1984,  The  Principles  and  Applications  of  De¬ 
cision  Analysis.  Strategic  Decisions  Group,  Menlo 
Park,  CA  94025. 

[Howard,  I960]  Howard,  Ronald  A.  1960.  Dynamic 
Programming  and  Markov  Processes.  MIT  Press, 
Cambridge,  Massachusetts. 

[Kemeny  and  Snell,  I960]  Kemeny,  J.  G.  and  Snell, 
J.  L.  1960.  Finite  Markov  Chains.  D.  Van  Nos¬ 
trand,  New  York. 

[Kushmerick  et  al..,  1995]  Kushmerick, 

Nicholas;  Hanks,  Steve;  and  Weld,  Daniel  1995.  An 
algorithm  for  probabilistic  planning.  Artificial  In¬ 
telligence  76(1-2). 

[Lee  and  Yannakakis,  1992]  Lee,  David  and  Yan- 
nakakis,  Mihalis  1992.  Online  minimization  of  tran¬ 
sition  systems.  In  Proceedings  of  24th  Annual  ACM 
Symposium  on  the  Theory  of  Computing. 

[Lin  and  Dean,  1995]  Lin,  Shieu-Hong  and  Dean, 
Thomas  1995.  Generating  optimal  policies  for  high- 
level  plans  with  conditional  branches  and  loops.  In 
Proceedings  of  the  Third  European  Workshop  on 
Planning.  205-218. 

[Pearl,  1988]  Pearl,  Judea  1988.  Probabilistic  Reason¬ 
ing  in  Intelligent  Systems:  Networks  of  Plausible  In¬ 
ference.  Morgan  Kaufmann,  San  Francisco,  Califor¬ 
nia. 

[Puterman,  1994]  Puterman,  Martin  L.  1994.  Markov 
Decision  Processes.  John  Wiley  &  Sons,  New  York. 

[Schweitzer  et  al.^  1985]  Schweitzer,  Paul  J.;  Puter¬ 
man,  Martin  L.;  and  Kindle,  Kyle  W.  1985.  Iter¬ 
ative  aggregation-disaggregation  procedures  for  dis¬ 
counted  semi-Markov  reward  processes.  Operations 
Research  33(3):589-605. 

[Schweitzer,  1984]  Schweitzer,  Paul  J.  1984.  Aggrega¬ 
tion  methods  for  large  Markov  chains.  In  lazola, 
G.;  Coutois,  P.  J.;  and  Hordijk,  A.,  editors  1984, 
Mathemaical  Computer  Performance  and  Reliabil¬ 
ity.  Elsevier,  Amsterdam,  Holland.  275-302. 


OMt 


Bounded  Parameter  Markov  Decision  Processes 


Robert  Givan  and  Sonia  Leach  and  Thomas  Dean 


Depeirtment  of  Computer  Science,  Brown  University 
115  Waterman  Street,  Providence,  RI  02912,  USA 
http://www.cs.brown.edu/people/{rig,sml,tld} 
Phone:  (401)  863-7600  Fax:  (401)  863-7657 
Email:  {rlg,sml,tld}@cs,brown.edu 


Abstract.  In  this  paper,  we  introduce  the  notion  of  an  hounded  param- 
eter  Markov  decision  process  (BMDP)  as  a  generalization  of  the  familiar 
exact  MDR  A  bounded  parameter  MDP  is  a  set  of  exact  MDPs  spec¬ 
ified  by  giving  upper  and  lower  bounds  on  transition  probabilities  and 
rewards  (all  the  MDPs  in  the  set  share  the  same  state  and  action  space). 
BMDPs  form  an  efficiently  solvable  special  case  of  the  already  known 
class  of  MDPs  with  imprecise  parameters  (MDP IPs).  Bounded  parame¬ 
ter  MDPs  can  be  used  to  represent  variation  or  uncertainty  concerning 
the  parameters  of  sequential  decision  problems  in  cases  where  no  prior 
probabilities  on  the  parameter  values  are  available.  Bounded  parameter 
MDPs  can  also  be  used  in  aggregation  schemes  to  represent  the  varia¬ 
tion  in  the  transition  probabilities  for  different  base  states  aggregated 
together  in  the  same  aggregate  state. 

We  introduce  interval  value  functions  as  a  naturzJ  extension  of  tradi¬ 
tional  value  functions.  An  interval  value  function  assigns  a  closed  real 
interval  to  each  state,  representing  the  assertion  that  the  value  of  that 
state  falls  within  that  interval.  An  interval  value  function  can  be  used 
to  bound  the  performance  of  a  policy  over  the  set  of  exact  MDPs  asso¬ 
ciated  with  a  given  bounded  parameter  MDP.  We  describe  an  iterative 
dynamic  programming  algorithm  called  interval  policy  evaluation  which 
computes  an  interval  value  function  for  a  given  BMDP  and  specified  pol¬ 
icy.  Interval  policy  evaluation  on  a  policy  tt  computes  the  most  restrictive 
interv2il  value  function  that  is  soimd,  t.e.,  that  bounds  the  value  function 
for  TT  in  every  exact  MDP  in  the  set  defined  by  the  bounded  parameter 
MDP.  We  define  optimistic  and  pessimistic  notions  of  optimal  policy,  and 
provide  a  variant  of  value  iteration  [BellmEin,  1957]  that  we  call  interval 
value  iteration  which  computes  a  policies  for  a  BMDP  that  are  optimal 
in  these  senses. 


1  Introduction 

The  theory  of  Markov  decision  processes  (MDPs)  provides  the  semantic  founda¬ 
tions  for  a  wide  range  of  problems  involving  planning  under  uncertainty  [Boutilier 
et  a/.,  1995a,  Littman,  1997].  In  this  paper,  we  introduce  a  generalization  of 
Markov  decision  processes  called  bounded  parameter  Markov  decision  processes 
(BMDPs)  that  allows  us  to  model  uncertainty  in  the  parameters  that  comprise 


an  MDP.  Instead  of  encoding  a  parameter  such  as  the  probability  of  making  a 
transition  from  one  state  to  another  as  a  single  number,  we  specify  a  range  of 
possible  values  for  the  parameter  as  a  closed  interval  of  the  real  numbers. 

A  BMDP  can  be  thought  of  as  a  family  of  traditional  (exact)  MDPs,  f.e., 
the  set  of  all  MDPs  whose  parameters  fall  within  the  specified  ranges.  From  this 
perspective,  we  may  have  no  justification  for  committing  to  a  particular  MDP 
in  this  family,  and  wish  to  analyze  the  consequences  of  this  lack  of  commitment. 
Another  interpretation  for  a  BMDP  is  that  the  states  of  the  BMDP  actually 
represent  sets  (aggregates)  of  more  primitive  states  that  we  choose  to  group 
together.  The  intervals  here  represent  the  ranges  of  the  parameters  over  the 
primitive  states  belonging  to  the  aggregates.  While  any  policy  on  the  original 
(primitive)  states  induces  a  stationary  distribution  over  those  states  which  can 
be  used  to  give  prior  probabilities  to  the  different  transition  probabilities  in  the 
intervals,  we  may  be  unable  to  compute  these  prior  probabilities — the  original 
reason  for  aggregating  the  states  is  typically  to  avoid  such  expensive  computation 
over  the  original  large  state  space. 

BMDPs  are  a  efficiently  solvable  specialization  of  the  already  known  Markov 
Decision  Processes  with  Imprecisely  Known  Transition  Probabilities  (MDPIPs). 
In  the  related  work  section  we  discuss  in  more  detail  how  BMDPs  relate  to 
MDPIPs. 

In  a  related  paper,  we  have  shown  how  BMDPs  can  be  used  as  part  of  a 
strategy  for  efficiently  approximating  the  solution  of  MDPs  with  very  large  state 
spaces  and  dynamics  compactly  encoded  in  a  factored  (or  implicit)  representa¬ 
tion  [Dean  et  a/.,  1997].  In  this  paper,  we  focus  exclusively  on  BMDPs,  on  the 
BMDP  analog  of  value  functions,  called  interval  value  functions^  and  on  policy 
selection  for  a  BMDP.  We  provide  BMDP  analogs  of  the  standard  (exact)  MDP 
algorithms  for  computing  the  value  function  for  a  fixed  policy  (plan)  and  (more 
generally)  for  computing  optimal  value  functions  over  all  policies,  called  inter¬ 
val  policy  evaluation  and  interval  value  iteration  (IVI)  respectively.  We  define 
the  desired  output  values  for  these  algorithms  and  prove  that  the  algorithms 
converge  to  these  desired  values  in  polynomial-time,  for  a  fixed  discount  factor. 
Finally,  we  consider  two  different  notions  of  optimal  policy  for  an  BMDP,  and 
show  how  IVI  can  be  applied  to  extract  the  optimal  policy  for  each  notion.  The 
first  notion  of  optimality  states  that  the  desired  policy  must  perform  better  than 
any  other  under  the  assumption  that  an  adversary  selects  the  model  parameters. 
The  second  notion  requires  the  best  possible  performance  when  a  friendly  choice 
of  model  parameters  is  assumed. 

2  Exact  Markov  Decision  Processes 

An  (exact)  Markov  decision  process  M  is  a  four  tuple  M  =  (Q,A,F,R)  where 
Q  is  a  set  of  states,  ^  is  a  set  of  actions,  jR  is  a  reward  function  that  maps  each 
state  to  a  real  value  R{q),^  and  F  is  a  state- transit  ion  distribution  so  that  for 

^  The  techniques  and  results  in  this  paper  easily  generalize  to  more  general  reward 
functions.  We  adopt  a  less  general  formulation  to  simplify  the  presentation. 


a  e  A  and  p,qeQ 


Fpgia)  =  Fi{Xt+i  =  q\Xt  =  p,Ut  =  a) 

where  Xt  and  Ut  are  random  variables  denoting,  respectively,  the  state  and 
action  at  time  t.  When  needed  we  will  write  denote  the  transition  function 
of  the  MDP  M. 

A  policy  is  a  mapping  from  states  to  actions,  it  :  Q  A.  The  set  of  all 
policies  is  denoted  11.  An  MDP  M  together  with  a  fixed  policy  it  €  11  determines 
a  Markov  chain  such  that  the  probability  of  making  a  transition  from  p  to  q  is 
defined  by  Fpq{n{p)).  The  expected  value  function  (or  simply  the  value  function) 
associated  with  such  a  Markov  chain  is  denoted  Vw,ir*  The  value  function  maps 
each  state  to  its  expected  discounted  cumulative  reward  defined  by 

Vm,t(p)  =  R{p)  +  7 

q€Q 

where  0  <  7  <  1  is  called  the  discount  rate?  In  most  contexts,  the  relevant  MDP 
is  clear  and  we  abbreviate  Vm.'k  as  Vir. 

The  optimal  value  function  (or  simply  V*  where  the  relevant  MDP  is 
clear)  is  defined  as  follows. 

v*{p)  =  max  (  R[p)  +  7 

V 

The  value  function  V*  is  greater  than  or  equal  to  any  value  function  V^r  in  the 
partial  order  >dom  defined  as  follows:  Vi  >dom  V2  if  and  only  if  for  all  states 

Vi{q)  >  V2{qY 

An  optimal  policy  is  any  policy  tt*  for  which  V*  =  14*  •  Every  MDP  has  at 
least  one  optimal  policy,  and  the  set  of  optimal  policies  can  be  found  by  replacing 
the  max  in  the  definition  of  V*  with  arg  max, 

3  Bounded  Parameter  Markov  Decision  Processes 

An  bounded  parameter  MDP  is  a  four  tuple  M  =  (Q,  jP,  R)  where  Q  and  A 
are  defined  as  for  MDPs,  and  F  and  R  are  analogous  to  the  MDP  F  and  R  but 
yield  closed  real  intervals  instead  of  real  values.  That  is,  for  any  action  a  and 
states  p,g,  R{p)  and  Fp^q{a)  are  both  closed  real  intervals  of  the  form  [/,u]  for  I 
and  u  real  numbers  with  I  <  where  in  the  case  of  F  we  require  0<Z<u<l.^ 
To  ensure  that  F  admits  well-formed  transition  functions,  we  require  that  for 

^  In  this  paper,  we  focus  on  expected  discounted  ciunulative  reward  as  a  performance 
criterion,  but  other  criteria,  e.p.,  total  or  average  reward  [Puterman,  1994],  are  also 
applicable  to  boimded  parameter  MDPs. 

^  To  simplify  the  remainder  of  the  paper,  we  assume  that  the  reward  bmmds  are  always 
tight,  t.e.,  that  for  all  g  €  Q,  for  some  real  I,  R{q)  =  [/,/],  and  we  refer  to  I  as  R{q). 
The  generalization  to  nontrivial  bounds  on  rewards  is  straightforward. 
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Fig.  1.  The  state- transition  diagram  for  a  simple  bounded  parameter  Markov  decision 
process  with  three  states  and  a  single  action.  The  arcs  indicate  possible  transitions  and 
are  labeled  by  their  lower  and  upper  bounds. 


any  action  a  and  state  p,  the  sum  of  the  lower  bounds  of  Fpq  (a)  over  all  states 
q  must  be  less  than  or  equal  to  1  while  the  upper  bounds  must  sum  to  a  value 
greater  than  or  equal  to  1.  Figure  1  depicts  the  state-transition  diagram  for  a 
simple  BMDP  with  three  states  and  one  action. 

A  BMDP  M  —  [Q^A.F^R)  defines  a  set  of  exact  MDPs  which,  by  abuse 
of  notation,  we  also  call  M.  For  exact  MDP  M  =  (Q',  >1',  F',  i^'),  we  have 
M  G  if  Q  =  A  =  A’ ^  and  for  any  action  a  and  states  p,  g,  R'{p)  is  in 
the  interval  .R(p)  and  Fp^^{a)  is  in  the  interval  Fp,g(a).  We  rely  on  context  to 
distinguish  between  the  tuple  view  of  M  and  the  exact  MDP  set  view  of  M  An 
the  definitions  in  this  section,  the  BMDP  M  is  implicit. 

An  interval  value  function  V  is  a  mapping  from  states  to  closed  real  intervals. 
We  generally  use  such  functions  to  indicate  that  the  given  state’s  value  falls 
within  the  selected  interval.  Interval  value  functions  can  be  specified  for  both 
exact  and  BMDPs.  As  in  the  case  of  (exact)  value  functions,  interval  value 
functions  are  specified  with  respect  to  a  fixed  policy.  Note  that  in  the  case  of 
BMDPs  a  state  can  have  a  range  of  values  depending  on  how  the  transition 
and  reward  parameters  are  instantiated,  hence  the  need  for  an  interval  value 
function. 

For  each  of  the  interval  valued  functions  F,  F,  V  we  define  two  real  valued 
functions  which  take  the  same  arguments  and  give  the  upper  and  lower  interval 
bounds,  denoted  F,  F,  V,  and  F,  F,  V,  respectively.  So,  for  example,  at  any 
state  q  we  have  V{q)  =  [y_(q),V{q)], 

Definition!.  For  any  policy  tt  and  state  q,  we  define  the  interval  value  Vn{q) 
of  TT  ai  q  to  be  the  interval 

min  Vm,^  (g),  max  Vm,^  (?) 

In  Section  5  we  will  give  an  iterative  algorithm  which  we  have  proven  to  converge 
to  14  •  In  preparation  for  that  discussion  we  now  state  that  there  is  at  least  one 


specific  MDP  in  M  which  simultaneously  achieves  Vrr{q)  for  all  states  q  (and 
likewise  a  specific  MDP  achieving  Y_^{q)  for  all  9), 

Definition  2.  For  any  policy  tt,  an  MDP  in  M  is  t: -maximizing  if  it  is  a  possible 
value  of  arg  maxM^M  VM,n  and  it  is  Tt-minimizing  if  it  is  in  arg  minMe  ,7r  • 

Theorem  3.  For  any  policy  tt,  there  exist  t: -maximizing  and  n-minimizing  MDPs 
in  M- 

This  theorem  implies  that  V_^  is  equivalent  to  mmM€A/i  Vm^tt  where  the  min¬ 
imization  is  done  relative  to  >domi  and  likewise  for  V  using  max.  We  give  an  al¬ 
gorithm  in  Section  5  whic^converges  to  by  also  converging  to  a  7r-minimizing 
MDP  in  M  (likewise  for  Vt^), 

We  now  consider  how  to  define  an  optimal  value  function  for  a  BMDP.  Con¬ 
sider  the  expression  maXn^n  Kr-  This  expression  is  ill-formed  because  we  have 
not  defined  how  to  rank  the  interval  value  functions  14  in  order  to  select  a  maxi¬ 
mum.  We  focus  here  on  two  different  ways  to  order  these  value  functions,  yielding 
two  notions  of  optimal  value  function  and  optimal  policy.  Other  orderings  may 
also  yield  interesting  results. 

First,  we  define  two  different  orderings  on  closed  real  intervals: 

n  1  ^  n  T  I  ^2?  or 

[lu  til]  <pe8  1/2,  t/2]  ^  I  =  Z2  and  Ui  <  U2 

r,  1  ^  n  1  r  til  <  ti2,  or 

[k,  «i]  <opt  [*2,  «2]  I  and  k  <  h 

We  extend  these  orderings  to  partially  order  interval  value  functions  by  relating 
two  value  functions  V\  <  V2  only  when  Vi{q)  <  V2{q)ioi  every  state  q»  We  can 
now  use  either  of  these  orderings  to  compute  maxTreij  ^4?  yielding  two  definitions 
of  optimal  value  function  and  optimal  policy.  However,  since  the  orderings  are 
partial  (on  value  functions),  we  must  still  prove  that  the  set  of  policies  contains 
a  policy  which  eichieves  the  desired  maximum  under  each  ordering  (f.e.,  a  policy 
whose  interval  value  function  is  ordered  above  that  of  every  other  policy). 

Definition  4.  The  optimistic  optimal  value  function  Vopt  and  the  pessimistic 
optimal  value  function  Vp^s  are  given  by: 

Vopt  =  maxneu  14  using  <opt  to  order  interval  value  functions 
Vpes  ==  maxTreii  14  using  <pes  to  order  interval  value  functions 

We  say  that  any  policy  tt  whose  interval  value  function  14  is  >opt  (>pes)  the  value 
functions  14'  of  all  other  policies  tt'  is  optimistically  (pessimistically)  optimal 

Theorems.  TAere  exists  at  least  one  optimistically  (pessimistically)  optimal 
policy,  and  therefore  the  definition  o/14pt  (%esj  *5  well-formed. 


The  above  two  notions  of  optimal  value  can  be  understood  in  terms  of  a 
game  in  which  we  choose  a  policy  tt  and  then  a  second  player  chooses  in  which 
MDP  M  in  to  evaluate  the  policy.  The  goal  is  to  get  the  highest^  resulting 
value  function  Vm.tt-  The  optimistic  optimal  value  function’s  upper  bounds  Vopt 
represent  the  best  value  function  we  can  obtain  in  this  game  if  we  assume  the 
second  player  is  cooperating  with  us.  The  pessimistic  optimal  value  function’s 
lower  bounds  represent  the  best  we  can  do  if  we  assume  the  second  player 
is  our  adversary,  trying  to  minimize  the  resulting  value  function. 

In  the  next  section,  we  describe  well-known  iterative  algorithms  for  comput¬ 
ing  the  exact  MDP  optimal  value  function  V*^  and  then  in  Section  5  we  will 
describe  similar  iterative  algorithms  which  compute  the  BMDP  variants  Popt 

(^es)‘ 

4  Estimating  Traditional  Value  Functions 

In  this  section,  we  review  the  basics  concerning  dynamic  programming  methods 
for  computing  value  functions  for  fixed  and  optimal  policies  in  traditional  MDPs. 
In  the  next  section,  we  describe  novel  algorithms  for  computing  the  interval 
analogs  of  these  value  functions  for  bounded  parameter  MDPs. 

We  present  results  from  the  theory  of  exact  MDPs  which  rely  on  the  concept 
of  normed  linear  spaces.  We  define  operators,  VI ^  and  P/,  on  the  space  of 
value  functions.  We  then  use  the  Banach  fixed-point  theorem  (Theorem  6)  to 
show  that  iterating  these  operators  converges  to  unique  fixed-points,  14  and  V* 
respectively  (Theorems  8  and  9). 

Let  V  denote  the  set  of  value  functions  on  Q.  For  each  €  V,  define  the  (sup) 
norm  of  v  by 

Hull  =  maxlu(9)l. 

We  use  the  term  convergence  to  mean  convergence  in  the  norm  sense.  The  space 
V  together  with  H-H  constitute  a  complete  normed  linear  space,  or  Banach  Space. 
If  t/  is  a  Banach  space,  then  an  operator  T  :U  — f/  is  a  contraction  mapping  if 
there  exists  a  A,  0  <  A  <  1  such  that  \\Tv  -  Tu\\  <  A|lu  -  wH  for  all  u  and  v  in  U. 
Define  VI  :V  and  for  each  tt  G  J7,  VIj:  :  V  — V  on  each  pE  Qhy 

VI(v){p)  =  max  I  i?(p)  +751  ^pg(o')u(g) 

V 

V/„(u)(p)  =  R{p)  +  7  X) 
gee 

In  cases  where  we  need  to  make  explicit  the  MDP  from  which  the  transition 
function  F  originates,  we  write  VIm^w  and  VI m  to  denote  the  operators  VIt^ 
and  VI  as  just  defined,  except  that  the  transition  function  F  is  F^ . 

Using  these  operators,  we  can  rewrite  the  expression  for  V*  and  \4  as 

V*{p)  =  VI{Vn{p)  and  V.{p)  =  Vh{Vn){p) 


^  Value  functions  are  ranked  by  >dom- 


for  all  states  p  G  Q.  This  implies  that  V*  and  V^r  are  fixed  points  of  VI  and  Vljr, 
respectively.  The  following  four  theorems  show  that  for  each  operator,  iterating 
the  operator  on  an  initial  value  estimate  converges  to  these  fixed  points. 

Theorem  6.  For  any  Banach  space  U  and  contraction  mapping  T  :  U  U , 
there  exists  a  unique  v*  in  U  such  that  Tv*  =  v* ;  and  for  arbitrary  in  U ,  the 
sequence  {v”}  defined  by  =  Tv^~^  =  converges  to  v* , 

Theorem  7,  VI  and  VItt  are  contraction  mappings. 

Theorem  6  and  Theorem  7  together  prove  the  following  fundamental  results 
in  the  theory  of  MDPs. 

Theorems.  There  exists  a  unique  r*  G  V  satisfying  v*  =  V I (v*);  furthermore ^ 
V*  —  V* ,  Similarly,  Vt:  is  the  unique  fixed-point  ofVI^t* 

Theorems.  For  arbitrary  G  V,  the  sequence  {u”}  defined  by  =  VI{v'^^^) 
=  VI^{v^)  converges  to  V* ,  Similarly,  iterating  VIt^  converges  to  \4. 

An  important  consequence  of  Theorem  9  is  that  it  provides  an  algorithm  for 
finding  V*  and  V-n*  In  particular,  to  find  V* ,  we  can  start  from  an  arbitrary 
initial  value  function  in  V,  and  repeatedly  apply  the  operator  VI  to  obtain 
the  sequence  This  algorithm  is  referred  to  as  value  iteration.  Theorem  9 

guarantees  the  convergence  of  value  iteration  to  the  optimal  value  function. 
Similarly,  we  can  specify  an  algorithm  called  policy  evaluation  which  finds  Vt^  by 
repeatedly  apply  VIn  starting  with  an  initial  G  V. 

The  following  theorem  from  [Littman  et  aL,  1995]  states  a  convergence  rate  of 
value  iteration  and  policy  evaluation  which  can  be  derived  using  bounds  on  the 
precision  needed  to  represent  solutions  to  a  linear  program  of  limited  precision 
(each  algorithm  can  be  viewed  as  solving  a  linear  program). 

Theorem  10.  For  fixed  7,  value  iteration  and  policy  evaluation  converge  to  the 
optimal  value  function  in  a  number  of  steps  polynomial  in  the  number  of  states, 
the  number  of  actions,  and  the  number  of  bits  used  to  represent  the  MDP  pa¬ 
rameters. 


5  Estimating  Interval  Value  Functions 

In  this  section,  we  describe  dynamic  programming  algorithms  which  operate 
on  bounded  parameter  MDPs.  We  first  define  the  interval  equivalent  of  policy 
evaluation  IV  It:  which  computes  14,  and  then  define  the  variants  IV I  opt  and 
IVIpes  which  compute  the  optimistic  and  pessimistic  optimal  value  functions. 


5.1  Interval  Policy  Evaluation 

In  direct  analogy  to  the  definition  of  in  Section  4,  we  define  a  function  IV I 
(for  interval  value  iteration)  which  maps  interval  value  functions  to  other  interval 
value  functions.  We  have  proven  that  iterating  IV I^  on  any  initial  interval  value 
function  produces  a  sequence  of  interval  value  functions  which  converges  to  14 
in  a  polynomial  number  of  steps,  given  a  fixed  discount  factore  7. 

IVI^niV)  is  an  interval  value  function,  defined  for  each  state  p  as  follows: 


IVh[V){p)^ 


M^M  majC  VlM,n(p){V)(p) 


We  define  IV I and  IV to  be  the  corresponding  mappings  from  value  func- 
tions  to  value  functions  (note  that  for  input  IV I ^  does  not  depend  on  V  and 
so  can  be  viewed  as  a  function  from  V  to  V — likewise  for  7v7^  and  V). 

The  algorithm  to  compute  IV In  is  very  similar  to  the  standard  MDP  com¬ 
putation  of  F/,  except  that  we  must  now  be  able  to  select  an  MDP  M  from 
the  family  M  which  minimizes  (maximizes)  the  value  attained.  We  select  such 
an  MDP  by  selecting  a  function  F  within  the  bounds  specified  by  F  to  mini¬ 
mize  (maximize)  the  value — each  possible  way  of  selecting  F  corresponds  to  one 
MDP  in  A4.  We  can  select  the  values  of  Fpg{a)  independently  for  each  a  and 
p,  but  the  values  selected  for  different  states  q  (for  fixed  a  and  p)  interact:  they 
must  sum  up  to  one.  We  now  show  how  to  determine,  for  fixed  a  and  p,  the 
value  of  Fpg{a)  for  each  state  ^  so  as  to  minimize  (maximize)  the  expression 
Sgec  This  step  constitutes  the  heart  of  the  IVI  algorithm  and 

the  only  significant  way  the  algorithm  differs  from  standard  value  iteration. 

The  idea  is  to  sort  the  possible  destination  states  q  into  increasing  (decreas¬ 
ing)  order  according  to  their  V  (P)  value,  and  then  choose  the  transition  prob¬ 
abilities  within  the  intervals  specified  by  F  so  as  to  send  as  much  probability 
mass  to  the  states  early  in  the  ordering.  Let  qi^q2^ .  •  -^qk  be  such  an  ordering 
of  Q — so  that,  in  the  minimizing  case,  for  all  i  and  j  if  l<i<j<k  then 
}L{Qi)  £  Y^iQj)  (increasing  order). 

Let  r  be  the  index  1  <  r  <  A:  which  maximizes  the  following  expression 
without  letting  it  exceed  1: 


i  =  l  izrr 

r  is  the  index  into  the  sequence  qi  such  that  below  index  r  we  can  assign  the 
upper  bound,  and  above  index  r  we  can  assign  the  lower  bound,  with  the  rest  of 
the  probability  mass  from  p  under  a  being  assigned  to  qr.  Formally,  we  choose 
Fpg{a)  for  all  g  E  Q  as  follows: 


l'p,q,{0‘)  ifi  <  r 
£p,g.(«)  ifi  >  r 


FpM  =  l-  E 

t  =  l, ifir 


Fig.  2.  An  illustration  of  the  basic  dynamic  programming  step  in  computing  an  ap¬ 
proximate  value  function  for  a  fixed  policy  and  boimded  parameter  MDR  The  lighter 
shaded  portions  of  each  arc  represent  the  required  lower  boimd  transition  probabil¬ 
ity  and  the  darker  shaded  portions  represent  the  fraction  of  the  remaining  tremsition 
probability  to  the  upper  bound  assigned  to  the  arc  by  F. 


Figure  2  illustrates  the  basic  iterative  step  in  the  above  algorithm,  for  the 
maximizing  case.  The  states  qi  are  ordered  according  to  the  value  estimates  in 
V.  The  transitions  from  a  state  p  to  states  qi  are  defined  by  the  function  F  such 
that  each  transition  is  equal  to  its  lower  bound  plus  some  fraction  of  the  leftover 
probability  mass. 

Techniques  similar  to  those  in  Section  4  can  be  used  to  prove  that  iterating 
IV I ^  (IV In)  converges  to  (^?r)*  The  key  theorems,  stated  below,  assert 
first  that  IV In  is  a  contraction  mapping,  and  second  that  V_n  is  a  fixed-point  of 
IVJ_n^  and  are  easily  proven^. 

Theorem  11.  For  any  policy  tt,  IV I ^  onJ  IV In  cire  contraction  mappings. 

Theorem  12.  For  any  policy  tt,  V_n  is  a  fixed-point  of  IV I ^  and  Vn  of  IV In- 

These  theorems,  together  with  Theorem  6  (the  Banach  fixed-point  theorem)  im¬ 
ply  that  iterating  IV In  on  any  initial  interval  value  function  converges  to  K, 
regardless  of  the  starting  point. 

Theorem  13.  For  fixed 'yj  interval  policy  evaluation  converges  to  the  desired  in¬ 
terval  value  function  in  a  number  of  steps  polynomial  in  the  number  of  states,  the 
number  of  actions,  and  the  number  of  bits  used  to  represent  the  MDP  parameters. 


*  The  min  over  members  of  M  is  dealt  with  using  a  technique  similar  to  that  used  to 
handle  the  max  over  actions  in  the  same  proof  for  V* 


5.2  Interval  Value  Iteration 

As  in  the  case  of  VIj:  and  VI,  it  is  straightforward  to  modify  IV It:  so  that  it 
computes  optimal  policy  value  intervals  by  adding  a  maximization  step  over  the 
different  action  choices  in  each  state.  However,  unlike  standard  value  iteration, 
the  quantities  being  compared  in  the  maximization  step  are  closed  real  intervals, 
so  the  resulting  algorithm  varies  according  to  how  we  choose  to  compare  real 
intervals.  We  define  two  variations  of  interval  value  iteration — other  variations 
are  possible. 

IVhpt{V){p)^  max 

^opt 

iyipes{V){p)  =  max 

<pes 

The  added  maximization  step  introduces  no  new  difficulties  in  implementing 
the  algorithm.  We  discuss  convergence  for  IV I  opt — the  convergence  results  for 
IV I  pes  are  similar.  We  write  IV I  opt  for  the  upper  bound  returned  by  IV I  opt  ^ 
and  we  consider  IV I  opt  a  function  from  V  to  V  because  IVIopt{V)  depends 
only  on  K.  IV I  opt  can  be  easily  shown  to  be  a  contraction  mapping,  and  it 
can  be  shown  that  t^pt  is  a  fixed  point  of  IV I  opt-  It  then  follows  that  IV I  opt 
converges  to  Vopt  in  polynomially  many  steps.  The  analogous  results  for  IV I 
are  somewhat  more  problematic.  Because  the  action  selection  is  done  according 
to  <opt?  which  focuses  primarily  on  the  interval  upper  bounds,  IVI^pt  i® 
properly  a  mapping  from  V  to  V,  as  depends  on  both  V  and  V» 

However,  for  any  particular  value  function  V  and  interval  value  function  V  such 
that  V  =  V,  we  can  write  IVJLopt,v  f^r  the  mapping  from  V  to  V  which  carries  V_ 
to  IVI^^f(V).  We  can  then  show  that  for  each  V,  IVIcpt.v  converges  as  desired. 
The  algorithm  must  then  iterate  IV I  opt  convergence  to  some  upper  bound  V, 
and  then  iterate  IV I y  to  converge  to  the  lower  bounds  V_ — each  convergence 
within  polynomial  time. 

Theorem  14.  A.  IV I  opt  and  IVI^q,  are  contraction  mappings. 

B.  For  any  value  functions  V,  IVI^^f  y  and  IVIpes.v  are  contraction  mappings. 

Theorem  15.  Vopt  a  fixed-point  of  IVIopt  >  Vpes  ofIVIpes- 

Theorem  16.  For  fixed  7,  iteration  of  IVIopt  converges  to  Vopt)  iteration 
of  IV I pes  converges  to  Vpesf  polynomially  many  iterations  in  the  problem  size 
(including  the  number  of  bits  used  in  specifying  the  parameters). 

6  Policy  Selection,  Sensitivity  Analysis,  and  Aggregation 

In  this  section,  we  consider  some  basic  issues  concerning  the  use  and  interpre¬ 
tation  of  bounded  parameter  MDPs.  We  begin  by  reemphasizing  some  ideas 
introduced  earlier  regarding  the  selection  of  policies. 


^/A/,a(Z)(p),  majc  VImA'^'Hp) 


^^A/,a(V)(p),  max  VImA^Kp) 

AzCAl 


To  begin  with,  it  is  important  that  we  are  clear  on  the  status  of  the  bounds 
in  a  bounded  parameter  MDP.  A  bounded  parameter  MDP  specifies  upper  and 
lower  bounds  on  individual  parameters;  the  assumption  is  that  we  have  no  addi¬ 
tional  information  regarding  individual  exact  MDPs  whose  parameters  fall  with 
those  bounds.  In  particular,  we  have  no  prior  over  the  exact  MDPs  in  the  family 
of  MDPs  defined  by  a  bounded  parameter  MDP. 

Policy  selection  Despite  the  lack  of  information  regarding  any  particular  MDP, 
we  may  have  to  choose  a  policy.  In  such  a  situation,  it  is  natural  to  consider 
that  the  actual  MDP,  Le,,  the  one  in  which  we  will  ultimately  have  to  carry  out 
some  policy,  is  decided  by  some  outside  process.  That  process  might  choose  so 
as  to  help  or  hinder  us,  or  it  might  be  entirely  indifferent.  To  minimize  the  risk 
of  performing  poorly,  it  is  reasonable  to  think  in  adversarial  terms;  we  select 
the  policy  which  will  perform  as  well  as  possible  assuming  that  the  adversary 
chooses  so  that  we  perform  as  poorly  as  possible. 

These  choices  correspond  to  optimistic  and  pessimistic  optimal  policies.  We 
have  discussed  in  the  last  section  how  to  compute  interval  value  functions  for 
such  policies — such  value  functions  can  then  be  used  in  a  straightforward  manner 
to  extract  policies  which  achieve  those  values. 

There  are  other  possible  choices,  corresponding  in  general  to  other  means  of 
totally  ordering  real  closed  intervals.  We  might  for  instance  consider  a  policy 
whose  average  performance  over  all  MDPs  in  the  family  is  as  good  as  or  better 
than  the  average  performance  of  any  other  policy.  This  notion  of  average  is 
potentially  problematic,  however,  as  it  essentially  assumes  a  uniform  prior  over 
exact  MDPs  and,  as  stated  earlier,  the  bounds  do  not  imply  any  particular  prior. 

Sensitivity  analysis  There  are  other  ways  in  which  bounded  parameter  MDPs 
might  be  useful  in  planning  under  uncertainty.  For  example,  we  might  assume 
that  we  begin  with  a  particular  exact  MDP,  say,  the  MDP  with  parameters  whose 
values  reflect  the  best  guess  according  to  a  given  domain  expert.  If  we  were  to 
compute  the  optimal  policy  for  this  exact  MDP,  we  might  wonder  about  the 
degree  to  which  this  policy  is  sensitive  to  the  numbers  supplied  by  the  expert. 

To  explore  this  possible  sensitivity  to  the  parameters,  we  might  assess  the 
policy  by  perturbing  the  parameters  and  evaluating  the  policy  with  respect  to 
the  perturbed  MDP.  Alternatively,  we  could  use  BMDPs  to  perform  this  sort  of 
sensitivity  analysis  on  a  whole  family  of  MDPs  by  converting  the  point  estimates 
for  the  parameters  to  confidence  intervals  and  then  computing  bounds  on  the 
value  function  for  the  fixed  policy  via  interval  policy  evaluation. 

Aggregation  Another  use  of  BMDPs  involves  a  different  interpretation  altogether. 
Instead  of  viewing  the  states  of  the  bounded  parameter  MDP  as  individual  prim¬ 
itive  states,  we  view  each  state  of  the  BMDP  as  representing  a  set  or  aggregate 
of  states  of  some  other,  larger  MDP. 

In  this  interpretation,  states  are  aggregated  together  because  they  behave 
approximately  the  same  with  respect  to  possible  state  transitions.  A  little  more 
precisely,  suppose  that  the  set  of  states  of  the  BMDP  Ad  corresponds  to  the  set 


of  blocks  such  that  the  {Bi}  constitutes  the  partition  of  another 

MDP  with  a  much  larger  state  space. 

Now  we  interpret  the  bounds  as  follows;  for  any  two  blocks  Bi  and  Bj^  let 
BBiBj  (ct)  represent  the  interval  value  for  the  transition  from  Bi  to  Bj  on  action  a 
defined  as  follows:  =  [minpgB,  Fpg{a),  maxp^B,  E,eB,  ^p?(«)] 

Intuitively,  this  means  that  all  states  in  a  block  behave  approximately  the  same 
(assuming  the  lower  and  upper  bounds  are  close  to  each  other)  in  terms  of 
transitions  to  other  blocks  even  though  they  may  differ  widely  with  regard  to 
transitions  to  individual  states. 

In  Dean  et  ah  [1997]  we  discuss  methods  for  using  an  implicit  representation 
of  a  exact  MDP  with  a  large  number  of  states  to  construct  an  explicit  BMDP 
with  a  possibly  much  smaller  number  of  states  based  on  an  aggregation  method. 
We  then  show  that  policies  computed  for  this  BMDP  can  be  extended  to  the 
original  large  implicitly  described  MDP.  Note  that  the  original  implicit  MDP 
is  not  even  a  member  of  the  family  of  MDPs  for  the  reduced  BMDP  (it  has  a 
different  state  space,  for  instance).  Nevertheless,  it  is  a  theorem  that  the  policies 
and  value  bounds  of  the  BMDP  can  be  soundly  applied  in  the  original  MDP 
(using  the  aggregation  mapping  to  connect  the  state  spaces). 

7  Related  Work  and  Conclusions 

Our  definition  for  bounded  parameter  MDPs  is  related  to  a  number  of  other 
ideas  appearing  in  the  literature  on  Markov  decision  processes;  in  the  follow¬ 
ing,  we  mention  just  a  few  such  ideas.  First,  BMDPs  specialize  the  MDPs  with 
imprecisely  known  parameters  (MDPIPs)  described  and  analyzed  in  the  op¬ 
erations  research  literature[White  and  Eldeib,  1994,  White  and  Eldeib,  1986, 
Satia  and  Lave,  1973].  The  more  general  MDPIPs  described  in  these  papers  re¬ 
quire  more  general  and  expensive  algorithms  for  solution.  For  example,  [White 
and  Eldeib,  1994]  allows  an  arbitrary  linear  program  to  define  the  bounds  on  the 
transition  probabilities  (and  allows  no  imprecision  in  the  reward  parameters) — 
as  a  result,  the  solution  technique  presented  appeals  to  linear  programming  at 
each  iteration  of  the  solution  algorithm  rather  than  exploit  the  specific  structure 
available  in  a  BMDP.  [Satia  and  Lave,  1973]  mention  the  restriction  to  BMDPs 
but  give  no  special  algorithms  to  exploit  this  restriction.  Their  general  MDPIP 
algorithm  is  very  different  from  our  algorithm  and  involves  two  nested  phases 
of  policy  iteration — the  outer  phase  selecting  a  traditional  policy  and  the  inner 
phase  selecting  a  “policy”  for  “nature”,  t.e.,  a  choice  of  the  transition  parameters 
to  minimize  or  maximize  value  (depending  on  whether  optimistic  or  pessimistic 
assumptions  prevail).  Our  work,  while  originally  developed  independently  of  the 
MDPIP  literature,  follows  similar  lines  to  [Satia  and  Lave,  1973]  in  defining 
optimistic  and  pessimistic  optimal  policies. 

Bertsekas  and  Castahon  [1989]  use  the  notion  of  aggregated  Markov  chains 
and  consider  grouping  together  states  with  approximately  the  same  residuals. 
Methods  for  bounding  value  functions  are  frequently  used  in  approximate  algo¬ 
rithms  for  solving  MDPs;  Lovejoy  [l99l]  describes  their  use  in  solving  partially 


observable  MDPs.  Puterman  [1994]  provides  an  excellent  introduction  to  Markov 
decision  processes  and  techniques  involving  bounding  value  functions. 

Boutilier  and  Dearden  [1994]  and  Boutilier  et  al.  [l995b]  describe  methods  for 
solving  implicitly  described  MDPs  and  Dean  and  Givan  [1997]  reinterpret  this 
work  in  terms  of  computing  explicitly  described  MDPs  with  aggregate  states. 

Bounded  parameter  MDPs  allow  us  to  represent  uncertainty  about  or  vari¬ 
ation  in  the  parameters  of  a  Markov  decision  process.  Interval  value  functions 
capture  the  resulting  variation  in  policy  values.  In  this  paper,  we  have  defined 
both  bounded  parameter  MDP  and  interval  value  function,  and  given  algorithms 
for  computing  interval  value  functions,  and  selecting  and  evaluating  policies. 
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Abstract 

In  this  paper,  we  introduce  the  notion  of  a  bounded^parameter  Markov  decision  process  (BMDP) 
as  a  generalization  of  the  familiar  exact  MDP.  A  bounded-parameter  MDP  is  a  set  of  exact  MDPs 
specified  by  giving  upper  and  lower  bounds  on  transition  probabilities  and  rewards  (all  the  MDPs 
in  the  set  share  the  same  state  and  action  space).  BMDPs  form  an  efficiently  solvable  special  case 
of  the  already  known  class  of  MDPs  with  imprecise  parameters  (MDPIPs).  Bounded-parameter 
MDPs  can  be  used  to  represent  variation  or  uncertainty  concerning  the  parameters  of  sequential 
decision  problems  in  cases  where  no  prior  probabilities  on  the  parameter  values  are  available. 
Bounded-parameter  MDPs  can  also  be  used  in  aggregation  schemes  to  represent  the  variation  in 
the  transition  probabilities  for  different  base  states  aggregated  together  in  the  same  aggregate 
state. 

We  introduce  interval  value  functions  as  a  natural  extension  of  traditional  value  functions.  An 
interval  value  function  assigns  a  closed  real  interval  to  each  state,  representing  the  assertion  that 
the  value  of  that  state  falls  within  that  interval.  An  interval  value  function  can  be  used  to  bound 
the  performance  of  a  policy  over  the  set  of  exact  MDPs  associated  with  a  given  bounded-param¬ 
eter  MDP.  We  describe  an  iterative  dynamic  programming  algorithm  called  interval  policy  evalu¬ 
ation  that  computes  an  interval  value  function  for  a  given  BMDP  and  specified  policy.  Interval 
policy  evaluation  on  a  policy  71  computes  the  most  restrictive  interval  value  function  that  is 
sound,  i.e.,  that  bounds  the  value  function  for  71  in  every  exact  MDP  in  the  set  defined  by  the 
bounded-parameter  MDP.  We  define  optimistic  and  pessimistic  criteria  for  optimality,  and  pro¬ 
vide  a  variant  of  value  iteration  [1]  that  we  call  interval  value  iteration  that  computes  policies  for 
a  BMDP  that  are  optimal  with  respect  to  these  criteria.  We  show  that  each  algorithm  we  present 
converges  to  the  desired  values  in  a  polynomial  number  of  iterations  given  a  fixed  discount  fac¬ 
tor. 

Keywords:  Decision-theoretic  planning.  Planning  under  uncertainty,  Approximate  planning, 
Markov  decision  processes. 

1.  Introduction 

The  theory  of  Markov  decision  processes  (MDPs)  [11][14][2][10][1]  provides  the 
semantic  foundations  for  a  wide  range  of  problems  involving  planning  under 
uncertainty  [5] [7].  Most  work  in  the  planning  subarea  of  artificial  intelligence 
addresses  problems  that  can  be  formalized  using  MDP  models  —  however,  it  is 
often  the  case  that  such  models  are  exponentially  larger  than  the  original  “inten- 
sional”  problem  representation  used  in  AI  work.  This  paper  generalizes  the  theory 
of  MDPs  in  a  manner  that  is  useful  for  more  compactly  representing  AI  problems 
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as  MDPs  via  state-space  aggregation,  as  we  discuss  below. 

In  this  paper,  we  introduce  a  generalization  of  Markov  decision  processes 
called  bounded-parameter  Markov  decision  processes  (BMDPs)  that  allows  us  to 
model  uncertainty  about  the  parameters  that  comprise  an  MDP.  Instead  of  encod¬ 
ing  a  parameter  such  as  the  probability  of  making  a  transition  from  one  state  to 
another  as  a  single  number,  we  specify  a  range  of  possible  values  for  the  parameter 
as  a  closed  interval  of  the  real  numbers. 

A  BMDP  can  be  thought  of  as  a  family  of  traditional  (exact)  MDPs,  i.e.,  the  set 
of  all  MDPs  whose  parameters  fall  within  the  specified  ranges.  From  this  perspec¬ 
tive,  we  may  have  no  justification  for  committing  to  a  particular  MDP  in  this  fam¬ 
ily,  and  wish  to  analyze  the  consequences  of  this  lack  of  commitment.  Another 
interpretation  for  a  BMDP  is  that  the  states  of  the  BMDP  actually  represent  sets 
(aggregates)  of  more  primitive  states  that  we  choose  to  group  together.  The  inter¬ 
vals  here  represent  the  ranges  of  the  parameters  over  the  primitive  states  belonging 
to  the  aggregates.  While  any  policy  on  the  original  (primitive)  states  induces  a  sta¬ 
tionary  distribution  over  those  states  that  can  be  used  to  give  prior  probabilities  to 
the  different  transition  probabilities  in  the  intervals,  we  may  be  unable  to  compute 
these  prior  probabilities  —  the  original  reason  for  aggregating  the  states  is  typi¬ 
cally  to  avoid  such  expensive  computation  over  the  original  large  state  space. 

Aggregation  of  states  in  very  large  state  spaces  was  our  original  motivation  for 
developing  BMDPs.  Substantial  effort  has  been  devoted  in  recent  years  within  the 
AI  community  [9]  [6]  [8]  to  the  problem  of  representing  and  reasoning  with  MDP 
problems  where  the  state  space  is  not  explicitly  listed  but  rather  implicitly  speci¬ 
fied  with  n  factored  representation.  In  such  problems,  an  explicit  listing  of  the  pos¬ 
sible  system  states  is  exponentially  longer  than  the  more  natural  implicit  problem 
description,  and  such  an  explicit  list  is  often  intractable  to  work  with.  Most  plan¬ 
ning  problems  of  interest  to  AI  researchers  fit  this  description  in  that  they  are  only 
representable  in  reasonable  space  using  implicit  representations.  Recent  work  in 
applying  MDPs  to  such  problems  {e.g.,  [9],  [6],  and  [8])  has  considered  state-space 
aggregation  techniques  as  a  means  of  dealing  with  this  problem;  rather  than  work 
with  the  possible  system  states  explicitly,  aggregation  techniques  work  with  blocks 
of  similar  or  identically-behaving  states.  When  aggregating  states  that  have  similar 
but  not  identical  behavior,  the  question  immediately  arises  of  what  transition  prob¬ 
ability  holds  between  the  aggregates:  this  probability  will  depend  on  which  under¬ 
lying  state  is  in  control,  but  this  choice  of  underlying  state  is  not  modelled  in  the 
aggregate  model.  This  work  can  be  viewed  as  providing  a  means  of  addressing  this 
problem  by  allowing  intervals  rather  than  point  values  for  the  aggregate  transition 
probabilities:  the  interval  can  be  chosen  to  include  the  true  value  for  each  of  the 
underlying  states  present  in  the  aggregates  involved.  It  should  be  noted  that  under 
these  circumstances,  deriving  a  prior  probability  distribution  over  the  true  parame¬ 
ter  values  is  often  as  expensive  as  simply  avoiding  the  aggregation  altogether  and 
would  defeat  the  purpose  entirely.  Moreover,  assuming  any  particular  probability 
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distribution  could  produce  arbitrarily  inaccurate  results.  As  a  result,  this  work  con¬ 
siders  parameters  falling  into  intervals  with  no  prior  probability  distribution  speci¬ 
fied  over  the  possible  parameter  values  in  the  intervals,  and  seeks  to  put  bounds  on 
how  badly  or  how  well  particular  plans  will  perform  in  such  a  context,  as  well  as  to 
provide  means  to  find  optimal  plans  under  optimistic  or  pessimistic  assumptions 
about  the  true  distribution  over  parameter  values.  In  Section  6,  we  discuss  the 
application  of  our  BMDP  approach  to  state-space  aggregation  problems  more  for¬ 
mally.  Also,  in  a  related  paper,  we  have  shown  how  BMDPs  can  be  used  as  part  of 
an  state-space  aggregation  strategy  for  efficiently  approximating  the  solution  of 
MDPs  with  very  large  state  spaces  and  dynamics  compactly  encoded  in  a  factored 
(or  implicit)  representation  [10]. 

We  also  discuss  later  in  this  paper  the  potential  use  of  BMDP  methods  to  eval¬ 
uate  the  sensitivity  of  the  optimal  policy  in  an  exact  MDP  to  small  variations  in  the 
parameter  values  defining  the  MDP  —  using  BMDP  policy  selection  algorithms  on 
a  BMDP  whose  parameter  intervals  represent  small  variations  (perhaps  confidence 
intervals)  around  the  exact  MDP  parameter  values,  the  best  and  worst  variation  in 
policy  value  achieved  can  be  measured. 

In  this  paper  we  introduce  and  discuss  BMDPs,  the  BMDP  analog  of  value 
functions,  called  interval  value  Junctions,  and  policy  selection  and  evaluation 
methods  for  BMDPs.  We  provide  BMDP  analogs  of  the  standard  (exact)  MDP 
algorithms  for  computing  the  value  function  for  a  fixed  policy  (plan)  and  (more 
generally)  for  computing  optimal  value  functions  over  all  policies,  called  interval 
policy  evaluation  and  interval  value  iteration  (TVI)  respectively.  We  define  the 
desired  output  values  for  these  algorithms  and  prove  that  the  algorithms  converge 
to  these  desired  values  in  polynomial  time,  for  a  fixed  discount  factor.  Finally,  we 
consider  two  different  notions  of  optimal  policy  for  a  BMDP,  and  show  how  IVI 
can  be  applied  to  extract  the  optimal  policy  for  each  notion.  The  first  notion  of 
optimality  states  that  the  desired  policy  must  perform  better  than  any  other  under 
the  assumption  that  an  adversary  selects  the  model  parameters.  The  second  notion 
requires  the  best  possible  performance  when  a  friendly  choice  of  model  parameters 
is  assumed. 

Our  interval  policy  evaluation  and  interval  value  iteration  algorithms  rely  on 
iterative  convergence  to  the  desired  values,  and  are  generalizations  of  the  standard 
MDP  algorithms  successive  approximation  and  value  iteration,  respectively.  We 
believe  it  is  also  possible  to  design  an  interval-valued  variant  of  the  standard  MDP 
algorithm  policy  iteration,  but  we  have  not  done  so  at  this  writing  —  however,  it 
should  be  clear  that  our  successive  approximation  algorithm  for  evaluating  policies 
in  the  BMDP  setting  provides  an  essential  basic  building  block  for  constructing  a 
policy  iteration  method;  all  that  need  be  added  is  a  means  for  selecting  a  new 
action  at  each  state  based  on  the  interval  value  function  of  the  preceding  policy 
(and  a  possibly  difficult  corresponding  analysis  of  the  properties  of  the  algorithm). 
We  note  that  there  is  no  consensus  in  the  decision-theoretic  planning  and  learning 
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and  operations-research  communities  as  to  whether  value  iteration,  policy  itera¬ 
tion,  or  even  standard  linear  programming  is  generally  the  best  approach  to  solving 
MDP  problems:  each  technique  appears  to  have  its  strengths  and  weaknesses. 

BMDPs  are  an  efficiently  solvable  specialization  of  the  already  known  class  of 
Markov  Decision  Processes  with  Imprecisely  Known  Transition  Probabilities 
(MDPIPs)  [15][17][18].  In  the  related  work  section  we  discuss  in  more  detail  how 
BMDPs  relate  to  MDPIPs. 

Here  is  a  high-level  overview  of  how  conceptual,  theoretical,  algorithmic,  and 
experimental  treatments  are  woven  together  in  the  remainder  of  the  paper.  We 
begin  by  introducing  the  concept  of  a  Bounded  Parameter  MDP  (BMDP),  and 
introducing  and  justifying  BMDP  analogues  for  optimal  policies  and  value  func¬ 
tions.  In  terms  of  the  theoretical  development,  we  define  the  basic  mathematical 
objects,  introduce  notational  conventions,  and  provide  some  background  in  MDPs. 
We  define  the  objects  and  operations  that  will  be  useful  in  the  subsequent  theoreti¬ 
cal  and  algorithmic  development,  e.g.,  composition  operators  on  MDPs  and  on 
policies.  Finally,  we  define  and  motivate  the  relevant  notions  of  optimality,  and 
then  prove  the  existence  of  optimal  policies  with  respect  to  the  different  notions  of 
optimality. 

In  addition  to  this  theoretical  and  conceptual  development,  in  terms  of  algo¬ 
rithm  development  we  describe  and  provide  pseudo-code  for  algorithms  for  com¬ 
puting  optimal  policies  and  value  functions  with  respect  to  the  different  notions  of 
optimality,  e.g.,  interval  policy  evaluation  and  interval  value  iteration.  We  provide 
an  analysis  of  the  complexity  of  these  algorithms  and  prove  that  they  compute 
optimal  policies  as  defined  earlier.  We  then  describe  a  proof-of-concept  imple¬ 
mentation  and  summarize  preliminary  experimental  results.  We  also  provide  a 
brief  overview  of  some  applications  including  sensitivity  analysis,  coping  with 
parameters  known  to  be  imprecise,  and  support  for  state  aggregation  methods. 
Finally,  we  survey  some  additional  related  work  not  covered  in  the  primary  text 
and  summarize  our  contributions. 

Before  introducing  BMDPs  and  their  algorithms  in  Section  4  and  Section  5,  we 
first  present  in  the  next  two  sections  a  brief  review  of  exact  MDPs,  policy  evalua¬ 
tion,  and  value  iteration  in  order  to  establish  notational  conventions  we  use 
throughout  the  paper.  Our  presentation  follows  that  of  [14],  where  a  more  com¬ 
plete  account  may  be  found. 

2.  Exact  Markov  Decision  Processes 

An  (exact)  Markov  decision  process  M  is  a  four  tuple  M  =  {Q,A,F,R)  where 
(2  is  a  set  of  states,  A  is  a  set  of  actions,  R  is  a  reward  function  that  maps  each 
state  to  a  real  value  R(q)  ^  and  F  is  a  state-transition  distribution  so  that  for  a  e  A 
and  p,qE  Q 
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(1) 


Fpjsf)  =  Pr(X,^i=^|X,=p,J/,=a) 

where  X,  and  U,  are  random  variables  denoting,  respectively,  the  state  and  action 
at  time  t .  When  needed  we  write  to  denote  the  transition  function  of  the  MDP 
M. 

A  policy  is  a  mapping  from  states  to  actions,  n:Q  A.  The  set  of  all  policies 
is  denoted  11 .  An  MDP  M  together  with  a  fixed  policy  7i  e  11  determines  a 
Markov  chain  such  that  the  probability  of  making  a  transition  from  p  to  ^  is 
defined  by  F  (jiip)) .  The  expected  value  function  (or  simply  the  value  function) 
associated  with  such  a  Markov  chain  is  denoted  .  The  value  function  maps 
each  state  to  its  expected  discounted  cumulative  reward  defined  by 

=  fiW  +  Y  S  F^^(Mp))Vu,^(q)  (2) 

Q 

where  0  <  y  <  1  is  called  the  discount  rate?  In  most  contexts,  the  relevant  MDP  is 
clear  and  we  abbreviate  „  as  . 

The  optimal  value  function  (or  simply  V*  where  the  relevant  MDP  is 
clear)  is  defined  as  follows. 

V*ip)  =  max  (R{p)  +  y  Fp^(a)V*iq)^  (3) 

The  value  function  V*  is  greater  than  or  equal  to  any  value  function  in  the  par¬ 
tial  order  >dom  defined  as  follows:  Vj  >dom  ^2  ^  ’ 

yj(^)  >  V2iq)  (in  this  case  we  say  that  Vj  dominates  ¥2)-  We  write  Vj  >doni 
to  mean  Vj  >dom  V^2  ^  ^ ^  • 

An  optimal  policy  is  any  policy  Ti*  for  which  V*  =  .  Every  MDP  has  at 

least  one  optimal  policy,  and  Ae  set  of  optimal  policies  can  be  found  by  replacing 
the  max  in  the  definition  of  V*  with  argmax . 

3.  Estimating  Traditional  Value  Functions 

In  this  section,  we  review  the  basics  concerning  dynamic  progranuning  methods 
for  computing  value  functions  for  fixed  and  optimal  policies  in  traditional  MDPs. 
We  follow  the  example  of  [14].  In  Section  5,  we  describe  novel  algorithms  for 


1.  The  techniques  and  results  in  this  paper  easily  generalize  to  more  general  reward  functions.  We 
adopt  a  less  general  formulation  to  simplify  the  presentation. 

2.  In  this  paper,  we  focus  on  expected  discounted  cumulative  reward  as  a  performance  criterion, 
but  other  criteria,  e.g.,  total  or  average  reward  [14],  are  also  applicable  to  bounded-parameter 
MDPs. 
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computing  the  interval  analogs  of  these  value  functions  for  bounded-parameter 
MDPs. 

We  present  results  from  the  theory  of  exact  MDPs  that  rely  on  the  concept  of 
normed  linear  spaces.  We  define  operators,  V/^  and  VI,  on  the  space  of  value 
functions.  We  then  use  the  Banach  fixed-point  theorem  (Theorem  1)  to  show  that 
iterating  these  operators  converges  to  unique  fixed-points,  V ^  and  V  respectively 
(Theorem  3  and  Theorem  4). 

Let  V  denote  the  set  of  value  functions  on  Q .  For  each  v  e  V  ,  define  the 
(sup)  norm  of  v  by 

llvll  =  max  |v(g)| .  (4) 

We  use  the  term  convergence  to  mean  convergence  in  the  norm  sense.  The  space 
V  together  with  ||  •  |1  constitute  a  complete  normed  linear  space,  or  Banach  Space. 
If  f/  is  a  Banach  space,  then  an  operator  T:U-^U  is  a  contraction  mapping  if 
there  exists  a  X ,  0  <  X  <  1  such  that  ||rv  —  TmH  ^  X||v  —  u||  for  all  u  and  v  in  17. 

Define  W:  V  ->  V  and  for  each  7i  6  fl ,  V  on  each  pe  Q  by 

y/(v)(p)  =  max  fl?(p)  +  y  X 

ae  qe  Q  ^ 

Vl^ivXp)  =  /?(p)  +  Y  E 

Q 

In  cases  where  we  need  to  make  explicit  the  MDP  from  which  the  transition  func¬ 
tion  F  originates,  we  write  „  and  V/^  to  denote  the  operators  Wjj  and  VI 
just  defined,  except  that  the  transition  function  F  is  .  More  generally,  we  write 
yi^  and  VI „:V  ^  V  to  denote  operators  defined  on  each  pG  Q  as: 

qe  Q 

=  ^(f)+y  E 

qe  Q 

Using  these  operators,  we  can  rewrite  the  definition  for  V*  and 
V\p)  =  VI(V*)(p)  and  y„(p)  =  Vl^iV^Xp) 

for  all  states  p&Q.  This  implies  that  y*  and  y„  are  fixed  points  of  VI  and  VI^, 
respectively.  The  following  four  theorems  show  that  for  each  operator,  iterating  the 
operator  on  an  initial  value  estimate  converges  to  these  fixed  points.  Proofs  for 
these  theorems  can  be  found  in  the  work  of  Puterman  [14]. 


(7) 


(8) 
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Theorem  1:  For  any  Banach  space  U  and  contraction  mapping  T:U  U , 
there  exists  a  unique  v*  in  U  such  that  Tv*  =  v* ;  and  for  arbitrary  v®  in  f/ , 
the  sequence  {v"}  defined  by  v”  =  Tv""*  =  converges  to  v*. 

Theorem  2:  VI  and  VI^  are  contraction  mappings. 

Theorem  1  and  Theorem  2  together  prove  the  following  fundamental  results  in  the 
theory  of  MDPs. 

Theorem  3;  There  exists  a  unique  v*  6  V  satisfying  v*  =  Vliy*) ;  further¬ 
more,  V*  =  V* .  Similarly  is  the  unique  fixed-point  of  . 

Theorem  4:  For  arbitrary  v®e  V,  the  sequence  {v”}  defined  by  v”  = 
y/(v”“i)  =  VF(v^)  converges  to  V* .  Similarly,  iterating  converges  to 

An  important  consequence  of  Theorem  4  is  that  it  provides  an  algorithm  for  find¬ 
ing  V*  and  .  In  particular,  to  find  V*  we  can  start  from  an  arbitrary  initial  value 
function  v®  in  V,  and  repeatedly  apply  the  operator  VI  to  obtain  the  sequence 
{v”} .  This  algorithm  is  referred  to  as  value  iteration.  Theorem 4  guarantees  the 
convergence  of  value  iteration  to  the  optimal  value  function.  Similarly,  we  can 
specify  an  algorithm  called  policy  evaluation  that  finds  by  repeatedly  applying 
y/jj  starting  with  an  initial  v®  e  V. 

The  following  theorem  from  [12]  states  a  convergence  rate  of  value  iteration 
and  policy  evaluation  that  can  be  derived  using  bounds  on  the  precision  needed  to 
represent  solutions  to  a  linear  program  of  limited  precision  (each  algorithm  can  be 
viewed  somewhat  nontrivially  as  solving  a  linear  program). 

Theorem  5:  For  fixed  y.  value  iteration  and  policy  evaluation  converge  to  the 
optimal  value  function  in  a  number  of  steps  polynomial  in  the  number  of  states, 
the  number  of  actions,  and  the  number  of  bits  used  to  represent  the  MDP 
parameters. 

Another  important  theorem  that  is  used  extensively  in  the  proofs  of  the  suc¬ 
ceeding  sections  results  directly  from  the  monotonicity  of  the  y/„  operator  with 
respect  to  the  <dom  and  >do„  orderings,  together  with  the  above  theorems. 

Theorem  6:  Let  7i  e  11  be  a  policy  and  M  an  MDP.  Suppose  there  exists 
ue  V  for  which  M<dom(^om)  “^dom(^dom)  M,n-  LUtewise 

for  the  orderings  <dom  and  >dom- 

4.  Bounded-parameter  Markov  Decision  Processes 

A  bounded-parameter  MDP (BMDP)  is  &  four  tuple  Mj  =  {Q,A,Fi,Ri)  where 
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Q  and  A  are  defined  as  for  MDPs,  and  Fj  and  Fj  are  analogous  to  the  MDP  F 
and  R  but  yield  closed  real  intervals  instead  of  real  values.  That  is,  for  any  action 
a  and  states  p,q,  Riip)  and  F^p^gia)  are  both  closed  real  intervals  of  the  form 
[/,  m]  for  real  numbers  /  and  u  with  l<u,  where  in  the  case  of  Fj  we  require 
0</<h<1.^To  ensure  that  Fj  admits  only  well-formed  transition  functions,  we 
require  that  for  any  action  a  and  state  p ,  the  sum  of  the  lower  bounds  of  Fj  ^(a) 
over  all  states  q  must  be  less  than  or  equal  to  1  while  the  upper  bounds  must  sum 
to  a  value  greater  than  or  equal  to  1  .  Figure  1  depicts  the  state-transition  diagram 
for  a  simple  BMDP  with  three  states  and  one  action.  We  use  a  one-action  BMDP  to 
illustrate  various  concepts  in  this  paper  because  multi-action  systems  are  awkward 
to  draw,  and  one  action  suffices  to  illustrate  the  concepts.  Note  that  a  one  action 
BMDP  or  MDP  has  only  one  policy  available  (select  the  only  action  at  all  states), 
and  so  represents  a  trivial  control  problem. 

A  BMDP  Ml  =  {Q,  A,  Fj,  Fj)  defines  a  set  of  exact  MDPs  that,  by  abuse  of 
notation,  we  also  call  Mj .  For  any  exact  MDP  M  =  {Q',A',  F',  F') ,  we  have 
M  E  Ml  if  Q  =  Q' ,  A  =  A' ,  and  for  any  action  a  and  states  p,  q ,  R'{p)  is  in 
the  interval  Ri{p)  and  F'^  (a)  is  in  the  interval  Fj  p,  q{a) .  We  rely  on  context  to 
distinguish  between  the  tuple  view  of  Mj  and  the  set  of  exact  MDPs  view  of  Afj . 
In  the  remaining  definitions  in  this  section,  the  BMDP  Mj  is  implicit.  Figure  3 
shows  an  example  of  an  exact  MDP  belonging  to  the  family  described  by  the 
BMDP  in  Figure  1.  We  use  the  convention  that  thick  wavy  lines  represent  interval 
valued  transition  probabilities  and  thinner  straight  lines  represent  exact  transition 
probabilities. 


3.  To  simplify  the  remainder  of  the  paper,  we  assume  that  the  reward  bounds  are  always  tight,  i.e., 
that  for  all  9  6  G  ,  for  some  real  / ,  «;(?)  =  [/,  /] ,  and  we  refer  to  I  as  R(q) .  The  generalization 
of  our  results  to  nontrivial  bounds  on  rewards  is  straightforward. 
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An  interval  value  function  VJ  is  a  mapping  from  states  to  closed  real  inter¬ 
vals.  We  generally  use  such  functions  to  indicate  that  the  value  of  a  given  state  falls 
within  the  selected  interval.  Interval  value  functions  can  be  specified  for  both  exact 
MDPs  and  BMDPs.  As  in  the  case  of  (exact)  value  functions,  interval  value  func¬ 
tions  are  specified  with  respect  to  a  fixed  policy.  Note  that  in  the  case  of  BMDPs  a 
state  can  have  a  range  of  values  depending  on  how  the  transition  and  reward 
parameters  are  instantiated,  hence  the  need  for  an  interval  value  function. 

For  each  interval  valued  function  (e.g.,  Fj,  Fj,  ,  and  those  we  define  later) 
we  define  two  real  valued  functions  that  take  the  same  arguments  and  return  the 
upper  and  lower  interval  bounds,  respectively,  denoted  by  the  following  syntactic 
variations:  Ft ,  Ft ,  Vt  for  upper  bounds,  and  Fi,Ri,  V;  for  lower  bounds,  respec¬ 
tively.  So,  for  example,  at  any  state  q  we  have  mq)  =  [VtCg),  VK?)]  • 

We  note  that  the  number  of  MDPs  M  e  is  in  general  uncountable.  We  start 
our  analysis  by  showing  that  there  is  a  finite  subset  €  Mj  of  these  MDPs  of 
particular  interest.  Given  any  ordering  O  of  all  the  states  in  Q ,  there  is  a  unique 
MDP  M  E  Ml  that  minimizes,  for  every  state  q  and  action  a ,  the  expected  “posi¬ 
tion  in  the  ordering”  of  the  state  reached  by  taWng  action  a  in  state  q  —  in  other 
words,  an  MDP  that  for  every  state  q  and  action  a  sends  as  much  probability  mass 
as  possible  to  states  early  in  the  ordering  O  when  taking  action  a  in  state  q.  For¬ 
mally,  we  define  the  following  concept: 

Definition  1.  Let  O  =  q^,  q2, be  an  ordering  of  Q  •  We  define  the 
order-maximizing  MDP  Mq  with  respect  to  ordering  O  as  follows. 

Let  r  be  the  index  1  <  r  <  it  that  maximizes  the  following  expression  without 
letting  it  exceed  1: 

r- 1  k 

1=1  i-r 


The  value  r  is  the  index  into  the  state  ordering  {q^}  such  that  below  index  r 
we  assign  the  upper  bound,  and  above  index  r  we  assign  the  lower  bound,  with 
the  rest  of  the  probability  mass  from  p  under  a  being  assigned  to  q^ .  Formally, 
we  select  Mq  e  Afj  by  choosing  F^°(a)  for  all  ^  e  2  as  follows: 


Ftpgfa)  if  j<r 
Fipgfa)  if  j>r 


i  =  k 

1-  S 


I  =  1, 
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Figure  2;  An  illustration  of  the  transition  probabilities  in  the  order-maximizing 
MDP  at  the  state  p  for  the  order  shown.  The  lighter  shaded  portions  of  each  arc 
represent  the  required  lower  bound  transition  probability  and  the  darker  shaded 
portions  represent  the  fraction  of  the  remaining  allowed  transition  probability 
assigned  to  the  arc  by  T. 


Figure  2  shows  a  diagrammatic  representation  of  the  order-maximizing  MDP  at  a 
particular  state  p  for  the  particular  ordering  of  the  state  space  shown.  Figure  3 
shows  the  order-maximizing  MDP  for  the  particular  BMDP  shown  in  Figure  1 
using  a  particular  state  order  (2  >  3  >  1),  as  a  concrete  example. 
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Definition  2.  Let  Xj^  be  the  set  of  order-maximizing  MDPs  Mq  in  Mj ,  one 
for  each  ordering  0 .  Note  that  since  there  are  finitely  many  orderings  of  states, 
is  finite. 

We  now  show  that  the  set  in  some  sense  contains  every  MDP  of  interest  from 
Ml .  In  particular,  we  show  that  for  any  policy  7i  and  any  MDP  M  in  ,  the  value 
of  u  in  Af  is  bracketed  by  values  of  7i  in  two  MDPs  in 

Lemma  1:  For  any  MDP  M  E  Mi, 

(a)  For  any  policy  7i  €  O ,  there  are  MDPs  G  X^^  and  M2  c  X^^  such 
that 


71  — dom  ^M,  71  — dom  71  • 

(b)  Also,  for  any  value  function  v  e  V,  there  are  MDPs  M3  g  X^^  and 
M,  G  X3,,  such  that 

w«„  .w  VIu. »  s-  ''V  .W  • 

Proof:  See  Appendix. 

Interval  Value  Functions  for  Policies.  We  now  define  the  interval  analogue  to  the 
traditional  MDP  policy-specific  value  function  ,  and  state  and  prove  some  of 
the  properties  of  Ais  interval  value  function.  The  development  here  requires  some 
care,  as  one  desired  property  of  the  definition  is  not  immediate.  We  first  observe 
that  we  would  like  an  interval-valued  function  over  the  state  space  that  satisfies  a 
Bellman  equation  like  that  for  traditional  MDPs  (as  given  by  Equation  2).  Unfortu¬ 
nately,  stating  a  Bellman  equation  requires  us  to  have  specific  transition  probabil¬ 
ity  distributions  F  rather  than  a  range  of  such  distributions.  Instead  of  defining 
policy  value  via  a  Bellman  equation,  we  define  the  interval  value  function  directly, 
at  each  state,  as  giving  the  range  of  values  that  could  be  attained  at  that  state  for  the 
various  choices  of  F  allowed  by  the  BMDP.  We  then  show  that  the  desired  mini¬ 
mum  and  maximum  values  can  be  achieved  independent  of  the  state,  so  that  the 
upper  and  lower  bound  value  functions  are  just  the  values  of  the  policy  in  particu¬ 
lar  “minimizing”  and  “maximizing”  MDPs  in  the  BMDP.  This  fact  enables  the  use 
of  the  Bellman  equations  for  the  minimizing  and  maximizing  MDPs  to  give  an 
iterative  algorithm  that  converges  to  the  desired  values,  as  presented  in  Section  5 

Definition  3.  For  any  policy  n  and  state  q ,  we  define  the  interval  value 
ofn  at  q  to  be  the  interval 
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We  note  that  the  existence  of  these  minimum  and  maximum  values  follows 
from  Lemma  1  and  the  finiteness  of  the  set  —  because  Lemma  1  implies 
that  j^{q)  is  the  same  as  the  following  where  the  minimization  and  maximiza¬ 
tion  are  done  over  finite  sets; 

'""•H-  (13) 

|_M Me  J 

In  preparation  for  the  discussion  in  Section  5,  we  show  in  Theorem  7  that  for  any 
policy  there  is  at  least  one  specific  policy-maximizing  MDP  in  Mj  that  achieves  the 
upper  bound  in  Definition  3  at  all  states  q  simultaneously  (and  likewise  a  different 
specific  policy-minimizing  MDP  that  achieves  the  lower  bound  at  all  states  q 
simultaneously).  We  formally  define  these  terms  below. 

Definition  4.  For  any  policy  ti  ,  an  MDP  M  e  is  7i  -maximizing  if  jj 
dominates  „  for  any  M'  e  ,  i.e.,  for  any  M'  E  Mi,  F^  „>dom 
Likewise,  M  g  ’Mj  is  n -minimizing  if  it  is  dominated  by  all  such  i.e., 

for  any  M  E  Mi ,  jj  — dom  ^m',  n  • 

Figure  4  shows  the  interval  value  function  for  the  only  policy  available  in  the  (triv¬ 
ial)  one-action  BMDP  shown  in  Figure  1,  along  with  the  7i-maximizing  and  7t-min- 
imizing  MDPs  for  that  policy. 

We  note  that  Lemma  1  implies  that  for  any  single  state  q  and  any  policy  n  we  can 
select  an  MDP  M  e  Mi  to  maximize  (or  minimize)  F ^(^)  by  selecting  the 
MDP  in  Xf^  that  gives  the  largest  value  for  tr  at  ^ .  However,  we  have  not  shown 
that  a  single  MDP  can  be  chosen  to  simultaneously  maximize  (or  minimize) 
71(9)  states  ^  e  (2  >  that  there  exist  71 -maximizing  and  7i -minimiz¬ 

ing  MDPs).  In  order  to  show  this  fact,  we  show  how  to  compose  two  MDPs  (with 
respect  to  a  fixed  policy  7C )  to  construct  a  third  MDP  such  that  the  value  of  n  in  the 
third  MDP  is  not  less  than  the  value  of  n  in  either  of  the  initial  two  MDPs,  at  every 
state.  We  can  then  construct  a  n  -maximizing  MDP  by  composing  together  all  the 
MDPs  that  maximize  the  value  of  7i  at  the  different  individual  states  (likewise  for 
n  -minimizing  MDPs  using  a  similar  composition  operator).  We  start  by  defining 
the  just  mentioned  policy-relative  composition  operators  on  MDPs: 

Definition  5.  Let  and  denote  composition  operators  on  MDPs  with 
respect  to  a  policy  7t  g  11 ,  defined  as  follows: 
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Vj  =  [80.1,  85.2] 


Reward  =  9 


[0.7,  0.8] 


Vj  =  [66.8, 76.7] 
Reward  =  1 


[0.7, 1.0]-^ 
[0.2, 0.5]" 


.[0.89,1.0] 


,[0.1,0.15] 
-[0.0, 0.1] 


Reward  =10 
Vt  =  [70.1,79.8] 


V  =  80.1 
Reward  =  9 


V  =  85.2 
Reward  =  9 


V  =  66.8 
Reward  =  1. 


V  =  76.7 
Reward  =  1 


Reward  =10 
V  =  70.1 

7i-minimizing  MDP 


Reward  =10 
V  =  79.8 


7C-maximizing  MDP 


Figure  4:  The  interval  value  function  (shown  as  Vj  on  the  top  subfigure), 
policy-minimizing  MDP  with  state  values  (lower  left),  and  policy-maximizing 
MDP  with  state  values  (lower  right)  for  the  one-action  BMDP  shown  in  Figure  1 
under  the  only  policy.  We  assume  a  discount  factor  of  0.9.  Note  that  the  lower- 
bound  values  in  the  interval  value  function  are  the  state  values  under  the  policy¬ 
minimizing  MDP,  and  the  upper-bound  values  are  the  state  values  under  the 
policy-maximizing  MDP.  Also,  note  that  the  policy-maximizing  MDP  is  the 
order-maximizing  MDP  for  the  state  order  3>2>1  and  the  policy-minimizing 
MDP  is  the  order-maximizing  MDP  for  the  order  1>2>3 — ^policy-minimizing 
and  maximizing  MDPs  are  always  order-maximizing  for  some  order  (but  the 
orders  need  not  be  reverse  to  one  another  as  they  are  in  this  example). 


Bounded-parameter  Markov  Decision  Processes,  June  16,  2000 


13 


V  =  84.8  V  =  85.0  V  =  85.2 

Reward  =  9  Reward  =  9  Reward  =  9 


Reward  =  10  Reward  =  10  Reward  =  10 

V  =  79.4  V  =  79.3  V  =  79.8 


MDPMi  MDPM2  ^l®max^2 

Figure  5:  Two  MDPs  Afj  and  M2  from  the  BMDP  shown  in  Figure  1,  and  their 
composition  under  ©^ax  where  7i  is  the  only  available  policy  in  the  one-action 
BMDP.  State  transition  probabilities  for  the  composition  MDP  are  selected  from 
the  component  MDP  that  achieves  the  greater  value  for  the  source  state  of  the 
transition.  State  values  are  shown  for  all  three  MDPs  —  note  that  the 
composition  MDP  achieves  higher  value  at  every  state,  as  claimed  in  Lemma  2. 


If  Mj,  Afj  e  Mi  ,  then  M3  =  M2  if  for  all  states  p,qe  Q, 

if  n(P)  ^  M  and  a=  nip) 

F^^(a)  otherwise 

If  Mj,  Mj  e  Mi  ,  then  M^,  =  M^  ©^^  M2  if  for  all  states  p,qe  Q, 

Fpq<^)  if  n(p)  ^  n(P)  and  a=  7i(p) 

F^^(a)  otherwise 

We  give  as  an  example  in  Figure  5  two  MDPs  from  the  BMDP  of  Figure  1,  along 
with  their  composition  under  the  ©^^^  operator  where  n  is  the  single  available 
policy  for  that  one-action  BMDP.  We  now  state  the  property  claimed  above  for  this 
MDP  composition  operator: 

Lemma  2:  Let  7i  be  a  policy  in  FI  and  Mp  M2  be  MDPs  in  Mi . 

(a) For  M3  =  M^®I^M2, 

^ M3, 31  -dom  ^Mi,  n  and  ^m3,  7t  -dom  3t  ’  and  (14) 
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(b)forM3  =Mi@^„M2, 


(15) 


7t  —do®  ^ n  n  — dom  JC  ' 

Proof:  See  Appendix, 

These  MDP  composition  operators  can  now  be  used  to  show  the  existence  of  pol¬ 
icy-maximizing  and  policy-minimizing  MDPs  within  Mj . 

Theorem  7:  For  any  policy  it  e  n ,  there  exist  n  -maximizing  and  n  -minimiz¬ 
ing  MDPs  in  Ml. 

Proof;  Enumerate  as  a  finite  sequence  of  MDPs  Mp  Consider 

composing  these  MDPs  together  to  construct  the  MDP  M  as  follows: 

M  =  (((M,  M2)  ©Sax  •••)  M,)  (16) 

Note  that  M  may  depend  on  the  ordering  of  M  j, , . M^ ,  but  that  any  ordering 
is  satisfactory  for  this  proof.  It  is  straightforward  to  show  by  induction  using 
Lemma  2  that  ^  >dom  „  for  each  1  <  /  <  A: ,  and  then  Lemma  1  implies 
that  for  any  M'e  Mj.  M  is  thus  a  7t -maximizing  MDP. 

Although  M  may  not  be  in  ,  Lemma  1  implies  that  “^ost  be  domi¬ 
nated  by  ^  for  some  M'  g  X^^  ,  which  must  also  be  7C  -maximizing. 

An  identical  proof  implies  the  existence  of  n  -minimizing  MDPs,  replacing 
each  occurrence  of  “max”  with  “min”  and  each  >don,  with  <dom  •  □ 

Corollary  1;  =  min^  ^  =  max^,^  ^  where  the 

minimum  and  maximum  are  computed  relative  to  <dom  and  are  well-defined  by 
Theorem  7. 

We  give  an  algorithm  in  Section  5  that  converges  to  by  also  converging  to  a 
7t -minimizing  MDP  in  Mj  (similarly  for  Vt^.  exchanging  7i -maximizing  for  n- 
minimizing). 

Optimal  Value  Functions  in  BMDPs.  We  now  consider  how  to  define  an  optimal 
value  function  for  a  BMDP.  First,  consider  the  expression  max^g  n(l^tji)  • 
expression  is  ill-formed  because  we  have  not  defined  how  to  rank  the  interval  value 
functions  in  order  to  select  a  maximum.'^  We  focus  here  on  two  different 
ways  to  order  these  value  functions,  yielding  two  notions  of  optimal  value  function 
and  optimal  policy.  Other  orderings  may  also  yield  interesting  results. 

First,  we  define  two  different  orderings  on  closed  real  intervals: 


4.  Similar  issues  arise  if  we  attempt  to  define  the  optimal  value  function  using  a  Bellman  style 
equation  such  as  Equation  3  because  we  must  compute  a  maximization  over  a  set  of  intervals. 
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([/l,Ml]<pes[/2,  M2])<=>(^1<^2  (l^  =  I2  A  <  U2)) 

([/l,  Ml]<op,  [/2,  M2])«>(«1<M2  or  (Hi  =  M2  a/j  </2)) 


(17) 


We  extend  these  orderings  to  partial  orders  over  interval  value  functions  by  relat¬ 
ing  two  value  functions  VJ  j  <op,  only  when  Vj  jC^)  <opt  V}2(^)  for  every  state 
q .  We  can  now  use  either  of  these  orderings  to  compute  max^  e  n  (^  n)  ’  yielding 
two  definitions  of  optimal  value  function  and  optimal  policy.  However,  since  the 
orderings  are  partial  (on  value  functions),  we  prove  first  (Theorem  8)  that  the  set  of 
policies  contains  a  policy  that  achieves  the  desired  maximum  under  each  ordering 
(i.e.,  a  policy  whose  interval  value  function  is  ordered  above  that  of  every  other 
policy). 


Definition  6.  An  optimistically  optimal  policy  Tiop,  is  any  policy  such  that 
>op,  Vj  jj  for  all  policies  n .  A  pessimistically  optimal  policy  Ttpes  is  any  pol¬ 
icy  such  that  for  all  policies  7t . 

pcs 

In  Theorem  8,  we  prove  that  there  exist  optimistically  optimal  policies  by 
induction  (an  analogous  proof  holds  for  pessimistically  optimal  policies).  We 
develop  this  proof  in  two  stages,  mirroring  the  two-stage  definition  of  >opt  (first 
emphasizing  the  upper  bound  and  then  breaking  ties  with  the  lower  bound).  We 
first  construct  a  policy  n'  for  which  the  upper  bounds  of  the  interval  value  function 
dominate  those  of  any  other  policy  .  We  then  show  that  the  finite  set 
of  such  policies  (all  tied  on  upper  bounds)  can  be  combined  to  construct  a  policy 
Ttjp,  with  the  same  upper  bound  values  and  whose  lower  bounds  domi¬ 
nate  those  of  any  other  policy.  Each  of  these  constructions  relies  on  the  following 
policy  composition  operator: 

Definition  7.  Let  ©^pj  and  ©p^^  denote  composition  operators  on  policies, 
defined  as  follows.  Consider  policies  71^112^  H , 


Let  7I3  =  TCj  ®opt^2  states  pe  Q  : 


n^ip)  = 


7ti(p) 

7t2(P) 


ifVi^_(p)^p.yi^^(p) 

otherwise 


(18) 


Let  713 


TC]  ©pgj 7X2  Q  - 


n^(p)  = 


7ti(p) 

7l2(p) 


otherwise 


(19) 
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Our  task  would  be  relatively  easy  if  it  were  necessarily  true  that 

^  (iCl  ®opt  ^2)  ^  ^  ("1  ®0|)1  %)  ^  ^2  ■ 

(and  likewise  for  the  pessimistic  case).  However,  because  of  the  lexicographic 
nature  of  ^p, ,  these  statements  do  not  hold  (in  particular,  the  lower  bound  values 
for  some  states  may  be  worse  in  the  composed  policy  than  in  either  component 
even  when  the  upper  bounds  on  those  states  do  not  change).  For  this  reason,  we 
prove  a  somewhat  weaker  result  that  must  be  used  in  a  two-stage  fashion  as  dem¬ 
onstrated  below: 

Lemma  3:  Given  a  BMDP  Mj ,  and  policies  Tij, n2€  Tl,  =  Ui  ®opt^2 ’ 
and  7I4  =  itj  @p^  712  ’ 

(a)  — dom  Snd  —dom  ^1^712 

(b)  If  then  V,,,  and 

(c)  ^  —dom  ^'^714  ^om  ^'^712 

(d) if  Vi„=yi„^then  and  yi„^>pes  y^^^ . 

Proof:  See  Appendix. 

Theorem  8:  There  exists  at  least  one  optimistically  (pessimistically)  optimal 
policy. 

Proof;  Enumerate  11  as  a  finite  sequence  of  policies  tCj,  Consider 

composing  these  policies  together  to  construct  the  policy  yp  as  follows: 

^opt,up  “  (((^1  ®opt  %)  ®opt  •••)  ®opt 

Note  that  Tt^pj  yp  may  depend  on  the  ordering  of  Ttj, . . .,  ,  but  that  any  order¬ 

ing  is  satisfactory  for  this  proof.  It  is  straightforward  to  show  by  induction  using 
Lemma  3  that  >dom  ^7%-  ®ach  1  <  i  <  ^ .  Now  enumerate  the  subset  of 

n  for  which  the  value  function  upper  bounds  equal  those  of  Ti^pj  ^p ,  i.e.,  enu¬ 
merate  {71'  I  =  Vtn  }  as  {71/, ...,  7t/} .  Consider  again  composing  the 
policies  Tt/  together  as'^DOve  to  form  the  policy  Ti^pf : 

^^opt  =  (((^1'  ®opt  ^2')  ®opt  •••)  ®opt  (22) 

It  is  again  straightforward  to  show  using  Lemma  3  that  ^  >dom  yi-Tt.'  for  each 
l<i</.  It  follows  immediately  that  ^opt  for' every  Tie  11,  as 
desired.  A  similar  construction  using  ©p^^  yields  a  pessimistically  optimal  pol¬ 
icy  TTpes  .  □ 

Theorem  8  justifies  the  following  definition: 
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Definition  8.  The  optimistic  optimal  value  function  Vj  op,  and  the  pessimistic 
optimal  value  function  Vjpes  are  given  by: 

Vjopt  =  using  ^p,  to  order  interval  value  functions 

^jpes  =  uiaXjjg  using  to  order  interval  value  functions 

The  above  two  notions  of  optimal  value  can  be  understood  in  terms  of  a  two  player 
game  in  which  the  first  player  chooses  a  policy  n  and  then  the  second  player 
chooses  the  MDP  M  in  Mj  in  which  to  evaluate  the  policy  n  (see  Shapley’s  work 
[16]  for  the  origins  of  this  viewpoint).  The  goal  for  the  first  player  is  to  get  the 
highest^  resulting  value  function  „ ,  The  upper  bounds  Vtopt  of  th®  optimisti¬ 
cally  optimal  value  function  represent  the  best  value  function  the  first  player  can 
obtain  in  this  game  if  the  second  player  cooperates  by  selecting  an  MDP  to  maxi¬ 
mize  jj  (the  lower  bound  Vj-op,  corresponds  to  how  badly  this  optimistic  strat¬ 
egy  for  the  first  player  can  misfire  if  the  second  player  betrays  the  first  player  and 
selects  an  MDP  to  minimize  ^ ).  The  lower  bounds  Vipes  of  the  pessimistically 
optimal  value  function  represent  the  best  the  first  player  can  do  under  the  assump¬ 
tion  that  the  second  player  is  an  adversary,  trying  to  minimize  the  resulting  value 
function. 

We  conclude  this  section  by  stating  a  Bellman  equation  theorem  for  the  opti¬ 
mal  interval  value  functions  just  defined.  The  equations  below  form  the  basis  for 
our  iterative  algorithm  for  computing  the  optimal  interval  value  functions  for  a 
BMDP.  We  start  by  stating  two  definitions  that  are  useful  in  proving  the  Bellman 
theorem  as  well  as  in  later  sections.  It  is  useful  to  have  notation  to  denote  the  set  of 
actions  that  maximize  the  upper  bound  at  each  state.  For  a  given  value  function  V , 
we  write  for  the  function  from  states  to  sets  of  actions  such  that  for  each  state 


Pv(p)  =  argmax  max  „(V)(p).  (23) 

a.  e  A  M  e  Ml 

Likewise,  for  the  pessimistic  case,  we  define  Oy  for  the  function  from  states  to 
sets  of  actions  giving  the  actions  that  maximize  the  lower  bound.  For  each  state  p , 
ayip)  is  given  by 

(5y{p)  -  argmax  min  „(V)(p).  (24) 

a  €  A  M  e  Afj 

Theorem  9:  For  any  BMDP  Mj ,  the  following  Bellman-like  equations  hold  at 
every  state  p. 


5.  Value  functions  are  ranked  by 
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VtoptCp)  =  max  nun  y/^  „('^iop,)(p),  max  „(^Topt)(p)  .  (25) 

a  €  A,  ^  ^Af  e  Ml  *  M  e  Ml  “* 

and 

VjpesCp)  =  max  [min  y4^„(npes)(p),max  y^  „(yrpes)(p)l.  (26) 

a  e  A,  ^  l-A/  €  Mj  M  e 

Proof:  See  Appendix. 

5.  Estimating  Interval  Value  Functions 

In  this  section,  we  describe  dynamic  programming  algorithms  that  operate  on 
bounded-parameter  MDPs.  We  first  define  the  interval  equivalent  of  policy  evalua¬ 
tion  /y/jn  which  computes  V^jj,  and  then  define  the  variants  /y/jopt  and  /V7tpes 
which  compute  the  optimistic  and  pessimistic  optimal  value  functions. 

5.1  Interval  Policy  Evaluation 

In  direct  analogy  to  the  exact  MDP  definition  of  in  Section  3,  we  define  a 
function  IVl^^  (for  interval  value  iteration)  which  maps  interval  value  functions  to 
other  interval  value  functions.  We  prove  that  iterating  /VZj^  on  any  initial  interval 
value  function  produces  a  sequence  of  interval  value  functions  that  converges  to 
in  a  polynomial  number  of  steps,  given  a  fixed  discount  factor  y  • 

IVlx^lYx)  is  an  interval  value  function,  defined  for  each  state  p  as  follows: 
/W,,(H)(P)  =  1  min  max  v/„  ,(Vt)(p)l  (27) 

^M  e  Ml  M  e  Ml 

We  define  /yZi^  and  /V7t„  to  be  the  corresponding  mappings  from  value  functions 
to  value  functions  (note  that  for  input  ,  /V7i^  does  not  depend  on  Vt  and  so  can 
be  viewed  as  a  function  from  V  to  V  —  likewise  for  IVI^^  and  Vi,). 

The  algorithm  to  compute  /V/:  „  is  very  similar  to  the  standard  MDP  computa¬ 
tion  of  VI,  except  that  we  must  now  be  able  to  select  an  MDP  M  from  the  family 
Mj  that  minimizes  (maximizes)  the  value  attained.  We  select  such  an  MDP  by 
selecting  a  transition  probability  function  F  within  the  bounds  specified  by  the  Fj 
component  of  Mj  to  minimize  (maximize)  the  value  —  each  possible  way  of 
selecting  F  corresponds  to  one  MDP  in  Mj .  We  can  select  the  values  of  Fp^(a) 
independently  for  each  a  and  p ,  but  the  values  selected  for  different  states  q  (for 
fixed  a  and  p )  interact:  they  must  sum  up  to  one.  We  now  show  how  to  determine, 
for  fixed  a  and  p ,  the  value  of  F  (a)  for  each  state  q  so  as  to  minimize  (maxi¬ 
mize)  the  expression  Xgg  Q(FpJa)V(q)).  This  step  constitutes  the  heart  of  the 
rvixj^  algorithm  and  the  only  significant  way  the  algorithm  differs  from  standard 
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IV 

y^M 

IV 

y^nW 

IV 


IV 

y^Mk) 

Figure  6:  An  illustration  of  the  basic  dynamic  programming  step  in 
computing  an  approximate  value  function  for  a  fixed  policy  and  bounded- 
parameter  MDP.  Vtjj  gives  the  upper  bounds  of  the  current  interval  estimates 
of  V_.  The  lighter  shaded  portions  of  each  arc  represent  the  required  lower 
bound  transition  probability  and  the  darker  shaded  portions  represent  the 
fraction  of  the  remaining  transition  probability  to  the  upper  bound  assigned  to 
the  arc  by  F . 

policy  evaluation  by  successive  approximation  by  iterating  VI ^ . 

To  compute  the  lower  bounds  the  idea  is  to  sort  the  possible  destination 
states  q  into  increasing  order  according  to  their  Vj.  value,  and  then  choose  the 
transition  probabilities  within  the  intervals  specified  by  Fj  so  as  to  send  as  much 
probability  mass  to  the  states  early  in  the  ordering  (upper  bounds  are  computed 
similarly,  but  sorting  the  states  into  decreasing  order  by  their  Vt  value).  Let 
O  =  ^2’  ordering  of  Q  —  so  that  for  all  i  and  j  if 

1  <  i  <  j<k  then  Viiq)  <  Vi{qj)  (increasing  order).  We  can  then  show  that  the 
order-maximizing  MDP  Mq  is  the  MDP  that  minimizes  the  desired  expression 
Q(^pq(^)y(Q))  •  The  order-maximizing  MDP  for  the  decreasing  order  based 
on  Vt  will  maximize  the  same  expression  to  generate  the  upper  bound  in 
Equation  27. 

Figure  6  illustrates  the  basic  iterative  step  in  the  above  algorithm,  for  the  upper 
bound,  ie.  maximizing,  case.  The  states  q^  are  ordered  according  to  the  value  esti¬ 
mates  in  Vt  •  The  transitions  from  a  state  p  to  states  q^  are  defined  by  the  function 
F  such  that  each  transition  is  equal  to  its  lower  bound  plus  some  fraction  of  the 
leftover  probability  mass.  For  a  more  precise  account  of  the  algorithm,  please  refer 
to  Figure  7  for  a  pseudocode  description  of  the  computation  of  /V/j^(Vj) . 

Techniques  similar  to  those  in  Section  3  can  be  used  to  prove  that  iterating 
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Proof:  See  Appendix. 


Theorem  11:  For  any  policy  n,  Vi„  is  a  fixed-point  of  and  of 

/V7tn ,  and  therefore  Vj^  is  a  fixed-point  of 


These  theorems,  together  with  Theorem  1  (the  Banach  fixed-point  theorem)  imply 
that  iterating  on  any  initial  interval  value  function  converges  to  Vj^,  regard¬ 
less  of  the  starting  point. 

Theorem  12:  For  fixed  y  <  1 »  interval  policy  evaluation  converges  to  the 
desired  interval  value  function  in  a  number  of  steps  polynomial  in  the  number 
of  states,  the  number  of  actions,  and  the  number  of  bits  used  to  represent  the 
BMDP  parameters. 

Proof:  (sketch)  We  provide  only  the  key  ideas  behind  this  proof. 

(a)  By  Theorem  10,  /V/j^  is  a  contraction  by  y  on  both  the  upper  and  lower 
bound  value  functions,  and  thus  the  successive  estimates  of  Vj  produced 
converge  exponentially  to  the  unique  fixed-point. 

(b)  By  Theorem  11,  the  unique  fixed-point  is  the  desired  value  function. 

(c)  The  upper  bound  and  lower  bound  value  functions  making  up  the  true 
Vjjj  are  the  value  functions  of  n  in  particular  MDPs  (ti -maximizing  and 
71 -minimizing  MDPs,  respectively)  in  . 

(d)  The  parameters  for  the  MDPs  in  can  be  specified  with  a  number  of 
bits  polynomial  in  the  number  of  bits  used  to  specify  the  BMDP  parame¬ 
ters. 

(e)  The  value  function  for  a  policy  in  an  MDP  can  be  written  as  the  solution 
to  a  linear  program.  The  precision  of  any  such  solution  can  be  bounded  in 
terms  of  the  number  of  bits  used  to  specify  the  linear  program.  This  preci¬ 
sion  bound  allows  the  definition  of  a  stopping  condition  for  /V7jjj  when 
adequate  precision  is  obtained. 

□  (Theorem  12). 

5.2  Interval  Value  Iteration 

As  in  the  case  of  altering  V/„  to  obtain  VI,  it  is  straightforward  to  modify 
so  that  it  computes  optimal  policy  value  intervals  by  adding  a  maximization  step 
over  the  different  action  choices  in  each  state.  However,  unlike  standard  value  iter¬ 
ation,  the  quantities  being  compared  in  the  maximization  step  are  closed  real  inter¬ 
vals,  so  the  resulting  algorithm  varies  according  to  how  we  choose  to  compare  real 
intervals.  We  define  two  variations  of  interval  value  iteration  —  other  variations 
are  possible. 
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/V/j  (VjXp)  =  max  min  a(Vi)(p),  max  VV  a(^t)(p)  (28) 

a  €  A,  <p,  '-M  e  Mj  M  e  Mf  -* 

^^tpes(H)(p)  =  max  [min  y/^  „(Vi)(p),  max  y/^  „(yT)(p)l  (29) 

aeA,  MeAf,  -• 

The  added  maximization  step  introduces  no  new  difficulties  in  implementing  the 
algorithm — ^for  more  details  we  provide  pseudocode  for  /y/|opt  ia  Figure  8.  We 
discuss  convergence  for  /y/jopt  —  the  convergence  results  for  Hpcs  are  similar. 
We  first  summarize  our  approach  and  then  cover  the  same  ground  in  more  detail. 

We  write  /y/topt  for  the  upper  bound  returned  by  /y/^opt’  and  we  consider 
/y/topt  a  function  from  V  to  V  because  /y/tQptfl^)  depends  only  on  yf  due  to  the 
way  <op,  compares  intervals  primarily  based  on  their  upper  bound,  /y/topt  ^an 
easily  be  shown  to  be  a  contraction  mapping,  and  it  can  be  shown  that  ytopt  is  a 
fixed  point  of  /V/topt  •  It  then  follows  that  /y/topt  converges  to  ytopt  (and  we  can 
argue  as  for  that  this  convergence  occurs  in  polynomially  many  steps  for 
fixed  y)-  The  analogous  results  for  /y/j-opt  are  somewhat  more  problematic. 
Because  the  action  selection  is  done  according  to  <opt ,  which  focus^  primarily  on 
the  interval  upper  bounds,  /y/topt  ts  not  properly  a  mapping  from  V  to  V,  as  the 
action  choice  for  IVh^^^(yx)  depends  on  both  yj.  and  Vr .  In  particular,  for  each 
state,  the  action  that  maximizes  the  lower  bound  is  chosen  from  among  the  subset 
of  actions  that  (equally)  maximize  the  upper  bound. 

To  deal  with  this  complication,  we  observe  that  if  we  fix  the  upper  bound  value 
function  V-t ,  we  can  view  IVh^^  as  a  function  from  V  to  V  carrying  the  lower 
bounds  of  the  input  value  function  to  the  lower  bounds  of  the  output.  To  formalize 
this  idea,  we  introduce  some  new  notation.  First,  given  two  value  functions  y  j  and 
V2  we  define  the  interval  value  function  [yj,  y2]  to  be  the  function  from  states  p 
to  intervals  [yi(p),  V2(p)]  (this  notation  is  essentially  the  inverse  of  the  i  and  t 
notation  which  extracts  lower  and  upper  bound  functions  from  interval  functions). 
Using  this  new  notation,  we  define  a  family  y}  of  functions  from  V  to 

V,  indexed  by  a  value  function  V.  For  each  value  function  y,  we  define 
/y/iopt  y(yO  to  be  the  function  from  V  to  V  that  maps  V'  to  /y/j,Qpj([y',  y]). 
(Anafogously,  we  define  yiV')  to  map  V'  to  /y/-rpes([y,  y'])).  We  note 

that  /y/iopt,  V  following  relationships  to  /V/topt  • 

/v;i,„(v,)  =  ivh^_  „,(n) 

In  analyzing  /y/j^pj ,  we  also  use  the  notation  defined  in  Section  4  for  the  set  of 
actions  that  maximize  the  upper  bound  at  each  state.  We  restate  the  relevant  defini¬ 
tion  here  for  convenience.  For  a  given  value  function  y ,  we  write  Py  for  the  func- 
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ivitopW 

\We  assume  that  Vj  is  represented  as: 

W  Vi  is  a  vector  ofn  real  numbers  giving  lower-bounds  for  states  qj  to 
W  Vt  is  a  vector  ofn  real  numbers  giving  upper-bounds  for  states  qjtoq^ 

{  Create  O,  a  vector  of  n  states  for  holding  a  permutation  of  the  states  qi  to  q„ 

\first,  compute  new  lower  bounds 

O  =  sort_increasing_order(^i,...,g„,<lb);  W  <jb  compares  state  lower-bounds 
Vl-Update(n,0); 

\\second,  compute  new  upper  bounds 

O  =  sort_decreasing_order(9l,...,q'„,<yb);  W  <ub  compares  state  upper-bnds 
VI-Update(V-r,0)} 


W  VI-Update(v,  o)  updates  v  using  the  order-maximizing  MDP  for  o 
W  o  is  a  state  ordering — a  vector  of  states  (a  permutation  ofq],...,q„) 

W  V  is  a  value  function — a  vector  of  real  numbers  of  length  n 

VI-Update(v,  o) 

{  Create  F^,  a  matrix  of  n  by  n  real  numbers  for  each  action  a 

W  the  next  loop  sets  each  to  describe  a  in  the  order-maximizing  MDP  for  o 
for  each  state  p  and  action  a  { 
used=  Y, 

State  q 

remaining  =  1  -  used; 

W  distribute  remaining  probability  mass  to  states  earlier  in  ordering 
for  1=1  to  n  {  W  i  is  used  to  index  into  ordering  o 

desired  =  ; 

if  (desired  <=  remaining) 

then  Faip,o{\))  =  min+desired; 
else  Fq^,o(\))  =  min+remaining; 
remaining  =  max(0,remaining-desired) } } 

W  Fq  now  describes  a  in  the  order-maximizing  MDP  w/respect  to  O, 
finally,  update  v  using  a  value  iteration-like  update  based  on  F’ 

for  each  state  p 

v(p)=  max  [f(p)  +  y  2  Fa(P>Q)v(q)  }] 

a€  A  state  q 

Figure  8:  Pseudocode:  an  iteration  of  optimistic  interval  value  iteration  (/V/jopj] 
tion  from  states  to  sets  of  actions  such  that  for  each  state  p , 

pv(p)  =  argmax  max  VI,^  ^(V){p)  (31) 

ae  A  M  €  Ml 
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Likewise,  for  the  pessimistic  case,  we  defined  Gy  in  Section  4. 


Given  the  definition  of  <op, ,  it  is  straightforward  to  show  the  following  lemma. 
Lemma  4;  For  any  value  functions  V,  V'  and  state  p , 

iyhopt,v(y'W  =  inax  min  „(r)(p) 

a  e  Qyip)  M  &  Ml  . 


JVhpts,  v(y')(p)  =  inax  min  y/^_„(^')(F) 

a e  Oyip)  Me  Ml 


Proof:  By  inspection  of  the  definitions  of  fVf:  opt  ^tpes  • 

□  (Lenuna4). 

We  now  show  that  for  each  V ,  /V/iopt,  v  is  a  contraction  mapping  relative  to  the 
sup  norm,  and  thus  converges  to  a  unique  fixed  point,  as  desired.  Theorem  9  then 
implies  that  Vj  op,  is  the  unique  fixed-point  found.  (Vjpes  in  the  case  of  /V7t  pes  )•  We 
then  show  at  that  at  any  point  after  polynomially  many  iterations  of  /y/jopt’ 
resulting  interval  value  function  Pj  has  upper  bounds  yt  that  have  converged  to  a 
fixed  point  of  /V/topt  >  Ih'JS  further  iteration  of  /y/j  opt  is  equivalent  to  iterating 
/y/topt  and  /y/j-opt  Vt  together  in  parallel  to  generate  the  upper  and  lower  bounds, 
respectively.  We  can  also  show  that  for  any  y ,  polynomially  many  iterations  of 
jy/i^pt  y  suffice  for  convergence  to  a  fixed  point.  Similar  results  hold  for  /y/jpgs  • 
We  now  give  the  details  of  these  results. 

Theorem  13: 

(a)  /y/fopt  and  are  contraction  mappings. 

(b)  For  any  value  function  y  and  associated  action  set  selection  function 
and  Gy ,  /V7iopt,  v  ^tpes,  v  contraction  mappings. 

Proof:  See  Appendix. 

Theorem  14:  For  fixed  y,  polynomially  many  iterations  of  /y/jopt  ^sed 
to  find  Vjop, ,  and  polynomially  many  iterations  of  can  be  used  to  find 

,  with  both  polynomials  defined  relative  to  the  problem  size  including  the 
number  of  bits  used  in  specifying  the  parameters. 

Proof:  (sketch) 

The  argument  here  is  exactly  as  in  Theorem  12,  relying  on  Theorems  9  and  13, 
except  that  the  iterations  must  be  taken  to  convergence  in  two  stages.  Consider¬ 
ing  /y/jopt  ’ iterate  until  the  upper  bound  has  converged,  with  the 
polynomial-time  bound  on  iterations  deriving  by  a  similar  argument  to  the 
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proof  of  Theorem  12;  then  once  the  upper  bounds  have  eonverged  we  must  then 
iterate  until  the  lower  bounds  have  converged,  again  in  polynomially  many  iter¬ 
ations  by  another  argument  similar  to  that  in  the  proof  of  Theorem  12. 

More  precisely,  let  Vji ,  2  ’  •  •  •  >  ^  sequence  of  interval  value  functions  found 

by  iterating  /V7topt’  *  greater  or  equal  to  1  we  have  Vt,  +  i 

equal  to  TV/j  opt(Vj ,) .  Then  an  argument  similar  to  the  proof  of  Theorem  12 
guarantees  that  for  some  j  polynomial  in  the  size  of  the  problem,  V^j  must  have 
upper  bounds  that  are  equal  to  the  true  fixed  point  upper  bound  values,  up  to  the 
maximum  precision  of  the  true  fixed  point.  We  then  know  that  truncating  the 
upper  value  bounds  in  V^j  to  that  precision  (to  get  an  interval  vab’.e  function 
Vj  j')  gives  the  true  fixed  point  upper  bound  values.  We  can  then  iterate  /Wj^pt 
starting  on  to  get  another  sequence  of  value  functions  where  the  upper 
bounds  are  unchanging  and  the  lower  bounds  are  converging  to  the  correct  fixed 
point  values  in  the  same  manner. 

A  similar  argument  shows  polynomial  convergence  for  . 

□  (Theorem  14). 

6.  Policy  Selection 

In  this  section,  we  consider  the  problem  of  selecting  a  policy  based  on  the  value 
bounds  computed  by  our  IVI  algorithms.  This  section  is  not  intended  as  an  addi¬ 
tional  research  contribution  as  much  as  a  discussion  of  issues  that  arise  in  solving 
BMDP  problems  and  of  alternative  approaches  to  policy  selection  (other  than  the 
optimistic  and  pessimistic  approaches  we  take  here).  We  begin  by  reemphasizing 
some  ideas  introduced  earlier  regarding  the  selection  of  policies.  To  begin  with,  it 
is  important  that  we  are  clear  on  the  status  of  the  bounds  in  a  bounded-parameter 
MDP.  A  bounded-parameter  MDP  specifies  upper  and  lower  bounds  on  individual 
parameters;  the  assumption  is  that  we  have  no  additional  information  regarding 
individual  exact  MDPs  whose  parameters  fall  with  those  bounds.  In  particular,  we 
have  no  prior  over  the  exact  MDPs  in  the  family  of  MDPs  defined  by  a  bounded- 
parameter  MDP.  We  note  again  that  in  many  applications  it  is  possible  to  compute 
prior  probabilities  over  these  parameters,  but  that  these  computations  are  prohibi¬ 
tively  expensive  in  our  motivating  application  (solving  large  state-space  problems 
by  approximate  state-space  aggregation). 

Despite  the  fact  that  a  BMDP  does  not  specify  which  particular  MDP  we  are 
facing,  we  may  have  to  choose  a  policy.  In  such  a  situation,  it  is  natural  to  consider 
that  the  actual  MDP,  i.e.,  the  one  in  which  we  ultimately  have  to  cairy  out  the  pol¬ 
icy,  is  decided  by  some  outside  process.  That  process  might  choose  so  as  to  help  or 
hinder  us,  or  it  might  be  entirely  indifferent.  To  maximize  potential  performance, 
we  might  assume  that  the  outside  process  cooperates  by  choosing  the  MDP  in 
order  to  help  us;  we  can  then  select  the  policy  that  performs  as  well  as  possible 
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given  that  assumption.  In  contrast,  we  might  minimize  the  risk  of  performing 
poorly  by  thinking  in  adversarial  terms:  we  can  select  the  policy  that  performs  as 
well  as  possible  under  the  assumption  that  an  adversary  chooses  the  MDP  so  that 
we  perform  as  poorly  as  possible  (in  each  case  we  assume  that  the  MDP  is  chosen 
from  the  BMDP  family  of  MDPs  after  the  policy  has  been  selected  in  order  to  min¬ 
imize/maximize  the  value  of  that  policy). 

These  choices  correspond  to  optimistic  and  pessimistic  optimal  policies  as 
defined  above.  We  have  discussed  in  the  last  section  how  to  compute  interval  value 
functions  for  such  policies  —  such  value  functions  can  then  be  used  in  a  straight¬ 
forward  manner  to  extract  policies  that  achieve  those  values. 

We  note  that  it  may  seem  unnatural  to  be  required  to  take  an  optimistic  or  a 
pessimistic  approach  in  order  to  select  a  policy  —  certainly  this  is  not  analogous  to 
policy  selection  for  standard  MDPs.  This  requirement  grows  out  of  our  model 
assumption  that  we  have  no  prior  probabilities  on  the  model  parameters,  and  we 
have  argued  that  this  assumption  is  in  fact  natural  at  very  least  in  our  motivating 
domain  of  approximate  state-space  aggregation.  The  same  assumption  is  also  natu¬ 
ral  in  performing  sensitivity  analysis,  as  described  in  the  next  section.  We  also  note 
that  there  is  precedent  in  the  related  MDP  literature  for  considering  optimistic  and 
pessimistic  approaches  to  policy  selection  in  the  face  of  uncertainty  about  the 
model;  see,  for  example,  the  work  of  Satia  and  Lave  in  [15]. 

Alternative  approaches  to  selecting  a  policy  are  possible,  but  some  approaches 
that  seem  natural  at  first  run  into  trouble.  For  instance,  we  might  consider  placing  a 
uniform  prior  probability  on  each  model  parameter  within  its  specified  interval. 
Unfortunately,  the  model  parameters  cannot  in  general  be  selected  independently 
(because  they  must  together  represent  a  well-formed  probability  distribution  after 
selection),  and  there  may  not  even  be  any  joint  prior  distribution  over  the  parame¬ 
ters  which  marginalizes  to  the  uniform  distribution  over  the  provided  intervals 
when  marginalized  to  each  parameter.  Therefore,  the  uniform  distribution  over  the 
provided  intervals  does  not  enjoy  any  distinguished  status  —  it  may  not  even  cor¬ 
respond  to  a  well-formed  prior  over  the  underlying  MDPs  in  the  BI>^P  family. 

There  are  other  well-formed  choices  corresponding  to  other  means  of  totally 
ordering  real  closed  intervals  (other  than  <op,  and  <pes).  For  instance,  we  might 
order  intervals  by  their  midpoints,  asserting  a  preference  for  states  where  the  high¬ 
est  and  lowest  value  possible  in  the  underlying  MDP  family  have  a  high  mean.  It  is 
not  clear  when  this  choice  might  be  prefered;  however,  we  believe  our  methods  can 
be  naturally  adapted  to  compute  optimal  policy  values  for  other  interval  orderings, 
if  desired. 

A  natural  goal  would  be  to  find  a  policy  whose  average  performance  over  all 
MDPs  in  the  family  is  as  good  as  or  better  than  the  average  performance  of  any 
other  policy.  This  notion  of  average  is  potentially  problematic,  however,  as  it 
essentially  assumes  a  uniform  prior  over  exact  MDPs  and,  as  stated  earlier,  the 
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bounds  do  not  imply  any  particular  prior.  Moreover,  it  is  not  at  all  clear  how  to  find 
such  a  policy  —  our  methods  do  not  appear  to  generalize  in  this  direction.  As 
noted  just  above,  this  goal  does  not  correspond  to  assuming  a  uniform  prior  over 
the  model  parameters,  but  rather  a  more  complex  joint  distribution  over  the  param¬ 
eters.  Also,  this  average  case  solution  would  not  in  general  provide  useful  informa¬ 
tion  in  our  motivating  application  of  state-space  aggregation:  we  would  have  no 
guarantee  that  the  uniform  prior  over  MDP  models  consistent  with  the  BMDP  had 
any  useful  correlation  with  the  original  large  MDP  that  aggregated  to  the  BMDP. 
In  contrast,  as  discussed  below,  the  optimistic  and  pessimistic  bounds  we  compute 
apply  directly  to  any  MDP  when  the  BMDP  analyzed  is  formed  by  state-space 
aggregation  of  that  MDP.  Nevertheless,  the  question  of  how  to  compute  the  opti¬ 
mal  average  case  policy  for  a  BMDP  appears  to  be  a  useful  direction  for  future 
research. 

7.  Prototype  Implementation  Results  and  Potential  Applications 

In  this  section  we  discuss  our  intended  applications  for  the  new  BMDP  algorithms, 
and  present  empirical  results  from  a  prototype  implementation  of  the  algorithms 
for  use  in  state-space  aggregation.  We  note  that  no  particular  difficulties  were 
encountered  in  implementing  the  new  BMDP  algorithms  —  implementation  is 
more  demanding  than  that  of  standard  MDP  algorithms,  but  only  by  the  addition  of 
a  sorting  algorithm. 

Sensitivity  Analysis.  One  way  in  which  bounded-parameter  MDPs  might  be  useful 
in  planning  under  uncertainty  might  begin  with  a  particular  exact  MDP  (say,  the 
MDP  with  parameters  whose  values  reflect  the  best  guess  according  to  a  given 
domain  expert).  If  we  were  to  compute  the  optimal  policy  for  this  exact  MDP,  we 
might  wonder  about  the  degree  to  which  this  policy  is  sensitive  to  the  numbers 
supplied  by  the  expert. 

To  assess  this  possible  sensitivity  to  the  parameters,  we  might  perturb  the  MDP 
parameters  and  evaluate  the  policy  with  respect  to  the  perturbed  MDP.  Alterna¬ 
tively,  we  could  use  BMDPs  to  perform  this  sort  of  sensitivity  analysis  on  a  whole 
family  of  MDPs  by  converting  the  point  estimates  for  the  parameters  to  confidence 
intervals  and  then  computing  bounds  on  the  value  function  for  the  fixed  policy  via 
interval  policy  evaluation. 

Aggregation.  Another  use  of  BMDPs  involves  a  different  interpretation  altogeAer. 
Instead  of  viewing  the  states  of  the  bounded-parameter  MDP  as  individual  primi¬ 
tive  states,  we  view  each  state  of  the  BMDP  as  representing  a  set  or  aggregate  of 
states  of  some  other,  larger  MDP.  We  note  that  this  use  provides  our  original  moti¬ 
vation  for  developing  BMDPs,  and  therefore  it  is  this  use  that  we  give  prototype 
empirical  results  for  below. 

In  the  state-aggregate  interpretation  of  a  BMDP,  states  are  aggregated  together 


Bounded’parameter  Markov  Decision  Processes,  June  16,  2000 


28 


because  they  behave  approximately  the  same  with  respect  to  possible  state  transi¬ 
tions.  A  little  more  precisely,  suppose  that  the  set  of  states  of  the  BMDP  Mj  corre¬ 
sponds  to  the  set  of  blocks  {Bj,  such  that  the  {B,}  constitutes  the 

partition  of  another  MDP  with  a  much  larger  state  space. 

Now  we  interpret  the  bounds  as  follows;  for  any  two  blocks  and  Bj ,  let 

g  (a)  represent  the  interval  value  for  the  transition  from  to  Bj  on  action  a 
defined  as  follows: 

=  [min  X  max  X  (33) 

^pe  B,  qB  Bj  Bj  qe  Bj 

Intuitively,  this  means  that  all  states  in  a  block  behave  approximately  the  same 
(assuming  the  lower  and  upper  bounds  are  close  to  each  other)  in  terms  of  transi¬ 
tions  to  other  blocks  even  though  they  may  differ  widely  with  regard  to  transitions 
to  individual  states. 

In  Dean  et.  al  [10]  we  discuss  methods  for  using  an  implicit  representation  of 
a  exact  MDP  with  a  large  number  of  states  to  construct  an  explicit  BMDP  with  a 
possibly  much  smaller  number  of  states  based  on  an  aggregation  method.  We  then 
show  that  policies  computed  for  this  BMDP  can  be  extended  to  the  original  large 
implicitly-described  MDP.  Note  that  the  original  implicit  MDP  is  not  even  a  mem¬ 
ber  of  the  family  of  MDPs  for  the  reduced  BMDP  (it  has  a  different  state  space,  for 
instance).  Nevertheless,  it  is  a  theorem  that  the  policies  and  value  bounds  of  the 
BMDP  can  be  soundly  applied  in  the  original  MDP  (using  the  aggregation  map¬ 
ping  to  connect  the  state  spaces).  In  particular,  the  lower  interval  bounds  computed 
on  a  given  state  block  by  IVh^^  give  lower  bounds  on  the  optimal  value  for  states 
in  that  block  in  the  original  MDP;  likewise,  the  upper  interval  bounds  computed  by 
/VZ-fopt  give  upper  bounds  on  the  optimal  value  in  the  original  MDP. 

Empirical  Results.  We  constructed  a  prototype  implementation  of  our  BMDP 
algorithms,  interval  value  iteration  and  interval  policy  evaluation.  We  then  used 
this  implementation  in  conjunction  with  implementations  of  our  previously  pre¬ 
sented  approximate  state-space  aggregation  algorithms  [10]  in  order  to  compute 
lower  and  upper  bounds  on  the  values  of  individual  states  in  large  MDP  problems. 

The  MDP  problems  used  were  derived  by  partially  modelling  air  campaign 
planning  problems  using  implicit  MDP  representations.  These  problems  involve 
selecting  tasks  for  a  variety  of  military  aircraft  over  time  in  order  to  maximize  the 
utility  of  their  actions,  and  require  modeling  many  aspects  of  the  aircraft  capabili¬ 
ties,  resources,  crew,  and  tasks.  Modeling  the  full  problem  as  an  MDP  is  still  out  of 
reach  —  the  MDP  models  used  in  these  experiments  were  constructed  by  repre¬ 
senting  the  problem  at  varying  degrees  of  (extremely  coarse)  abstraction  so  that  the 
resulting  problem  would  be  within  reach  of  our  prototype  implementation. 

We  show  in  Table  15  the  original  problem  state-space  size,  the  state-space  size 
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Table  15:  Model  Size  after  Approximate  Minimization 


#State 

Vars 

#  States 

e  =  0 

e  =  0.01 

E  =  0.1 

£  =  0.3 

o 

II 

e  =  0.8 

9 

512 

114 

114 

72 

24 

11 

8 

10 

1024 

131 

122 

85 

55 

21 

21 

13 

8192 

347 

347 

272 

148 

66 

63 

14 

16384 

442 

153 

67 

63 

15 

32768 

520 

152 

88 

69 

IVI  Inaccuracy: 

0% 

0.2% 

10% 

40% 

58% 

62% 

of  the  BMDP  that  results  from  our  aggregation  algorithm,  and  the  quality  of  the 
resulting  state-value  bounds  for  several  different  sized  MDP  problems.  Each  row 
in  the  table  corresponds  to  a  specific  explicit  MDP  that  we  solved  (approximately 
and/or  exactly)  using  state-space  aggregation.  We  note  that  one  parameter  (e)  of 
our  aggregation  method  is  the  degree  of  approximation  tolerated  in  transition  prob¬ 
ability  —  this  corresponds  to  the  interval  width  in  the  BMDP  parameter  intervals. 
As  this  parameter  is  given  larger  and  larger  values  across  the  columns  of  the  table, 
the  aggregate  BMDP  model  has  fewer  and  fewer  states  —  in  return,  the  value 
bounds  obtained  are  less  and  less  tight.  The  quality  of  the  resulting  state-value 
bounds  is  given  by  showing  “IVI  Inaccuracy”  —  this  percentage  is  the  average 
width  of  the  value  intervals  computed  as  a  percentage  of  the  difference  between 
the  lowest  possible  state  value  and  the  highest  possible  state  value  (these  are 
defined  by  assuming  a  repeated  occurence  of  the  lowest/highest  reward  available 
for  an  infinite  time  period  and  computing  the  total  discounted  reward  obtained). 
Our  prototype  aggregation  code  was  incapable  of  handling  the  exact  and  near- 
exact  analysis  of  the  largest  models  tried,  and  those  entries  in  the  table  are  there¬ 
fore  missing. 

We  note  that  IVI  inaccuracies  of  much  greater  than  25%  may  not  represent 
very  useful  bounds  on  state  value  (we  have  not  yet  conducted  experiments  to  eval¬ 
uate  this  question).  For  this  reason,  the  last  three  columns  of  the  table  are  shown 
primarily  for  completeness  and  to  satisfy  curiosity.  However,  an  inaccuracy  of 
10%  can  be  expected  to  yield  useful  information  in  selecting  between  different 
control  actions  —  we  can  think  of  this  level  of  inaccuracy  as  allowing  us  to  rate 
each  state  on  a  scale  of  one  to  ten  as  to  how  good  its  value  is.  Such  ratings  should 
be  very  useful  in  designing  control  policies. 

We  note  that  our  prototype  code  is  not  optimized  in  its  handling  of  either  space 
or  time.  Similar  prototype  code  for  explicit  MDP  problems  can  handle  no  more 
than  a  few  hundred  states.  Production  versions  of  explicit  MDP  code  today  can 
handle  as  many  as  a  million  or  so  states.  Our  aggregation  and  BMDP  algorithms, 
even  in  this  unoptimized  form,  are  able  to  obtain  nontrivial  bounds  on  state  value 
for  state-space  sizes  involving  thousands  of  states.  We  believe  that  a  production 
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version  of  these  algorithms  could  derive  near-optimal  policies  for  MDP  planning 
problems  involving  hundreds  of  millions  of  states. 


8.  Related  Work  and  Conclusions 

Our  definition  for  bounded-parameter  MDPs  is  related  to  a  number  of  other  ideas 
appearing  in  the  literature  on  Markov  decision  processes;  in  the  following,  we 
mention  just  a  few  of  the  closest  such  ideas.  First,  BMDPs  specialize  the  MDPs 
with  imprecisely  known  parameters  (MDPIPs)  described  and  analyzed  in  the  oper¬ 
ations  research  literature  by  White  and  Eldeib  [17],  [18],  and  Satia  aiid  Lave  [15]. 
The  more  general  MDPIPs  described  in  these  papers  require  more  general  and 
expensive  algorithms  for  solution.  For  example,  [17]  allows  an  arbitrary  linear  pro¬ 
gram  to  define  the  bounds  on  the  transition  probabilities  (and  allows  no  impreci¬ 
sion  in  the  reward  parameters)  —  as  a  result,  the  solution  technique  presented 
appeals  to  linear  programming  at  each  iteration  of  the  solution  algorithm  rather 
than  exploit  the  specific  structure  available  in  a  BMDP  as  we  do  here.  [15]  men¬ 
tions  the  restriction  to  BMDPs  but  gives  no  special  algorithms  to  exploit  this 
restriction.  Their  general  MDPIP  algorithm  is  very  different  from  our  algorithm 
and  involves  two  nested  phases  of  policy  iteration  —  the  outer  phase  selecting  a 
traditional  policy  and  the  inner  phase  selecting  a  “policy”  for  “nature”,  i.e.,  a 
choice  of  the  transition  parameters  to  minimize  or  maximize  value  (depending  on 
whether  optimistic  or  pessimistic  assumptions  prevail).  Our  work,  while  originally 
developed  independently  of  the  MDPIP  literature,  follows  similar  lines  to  [15]  in 
defining  optimistic  and  pessimistic  optimal  policies.  In  summary,  when  uncer¬ 
tainty  about  MDP  parameters  is  such  that  a  BMDP  model  is  appropriate,  the 
MDPIP  literature  does  not  provide  an  approach  that  exploits  the  restricted  structure 
to  achieve  an  efficient  method  (we  note  appealing  to  linear  programming  at  each 
iteration  can  be  very  expensive). 

Shapley  [16]  introduced  the  notion  of  stochastic  games  to  describe  two-person 
games  in  which  the  transition  probabilites  are  controlled  by  the  two  players. 
MDPIPs,  and  therefore  BMDPs,  are  a  special  case  of  alternating  stochastic  games 
in  which  the  first  player  is  the  decision-making  agent  and  the  second  player,  often 
considered  as  either  an  adversary  or  advocate,  makes  its  move  by  choosing  from 
the  set  of  possible  MDPs  consistent  with  having  seen  the  agent’s  move. 

Bertsekas  and  Castanon  [3]  use  the  notion  of  aggregated  Markov  chains  and 
consider  grouping  together  states  with  approximately  the  same  residuals.  Methods 
for  bounding  value  functions  are  frequently  used  in  approximate  algorithms  for 
solving  MDPs;  Lovejoy  [13]  describes  their  use  in  solving  partially  observable 
MDPs.  Puterman  [14]  provides  an  excellent  introduction  to  Markov  decision  pro¬ 
cesses  and  techniques  involving  bounding  value  functions. 

Boutilier,  Dean  and  Hanks  [5]  provide  a  careful  treatment  of  MDP-related 
methods  demonstrating  how  they  provide  a  unifying  framework  for  modeling  a 
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wide  range  of  problems  in  AI  involving  planning  under  uncertainty.  This  paper 
also  describes  such  related  issues  as  state  space  aggregation,  decomposition  and 
abstraction  as  these  ideas  pertain  to  work  in  AI.  We  encourage  the  reader  unfamil¬ 
iar  with  the  connection  between  classical  planning  methods  in  AI  and  Markov 
decision  processes  to  refer  to  this  paper. 

Boutilier  and  Dearden  [6]  and  Boutilier  et.  al.  [8]  describe  methods  for  solving 
implicitly  described  MDPs  using  dynamic  aggregation  —  in  their  methods  the 
state  space  aggregates  vary  over  the  iterations  of  the  dynamic  programming  algo¬ 
rithm.  This  work  can  be  viewed  as  using  a  compact  representation  of  both  policies 
and  value  functions  in  terms  of  state  aggregates  to  perform  the  familiar  dynamic 
programming  algorithms.  Dean  and  Givan  [9]  reinterpret  this  work  in  terms  of 
computing  explicitly  described  MDPs  with  aggregate  states  corresponding  to  the 
aggregates  that  the  above  compactly  represented  value  functions  use  when  they 
have  converged.  Dean,  Givan,  and  Leach  [10]  discuss  relaxing  these  aggregation 
techniques  to  construct  approximate  aggregations  —  it  is  from  this  work  that  the 
notion  of  BMDP  emerged  in  order  to  represent  the  resulting  aggregate  models. 

Bounded-parameter  MDPs  allow  us  to  represent  uncertainty  about  or  variation 
in  the  parameters  of  a  Markov  decision  process.  Interval  value  functions  capture 
the  resulting  variation  in  policy  values.  In  this  paper,  we  have  defined  both 
bounded-parameter  MDP  and  interval  value  function,  and  given  algorithms  for 
computing  interval  value  functions,  and  selecting  and  evaluating  policies. 
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11.  Appendix  —  Proofs  Omitted  Above  for  Readability 

Lemma  1:  For  any  policy  7t  e  11 ,  MDP  M  E  Mi,  and  value  function  v  6  V, 

(a)  there  are  MDPs  Af  j  e  and  M2  such  that 

,,  7t  — dom  n  — dom  71  ' 

(b)  Also,  there  are  MDPs  M3  and  M4  E  such  that 

V/m3,  7c(v)  ^do.  yiM,  ,r(v)  ^don,  VI rSy)  •  (35) 

Proof:  To  show  the  existence  of  Mj,  let  O  =  q^,  be  an  ordering  on 
states  such  that  for  all  i  and  j  if  \<i<j<k  then  ^{q^  <  ^{qp 
(increasing  order).  Note  that  ties  in  state  values  permit  different  orderings;  for 
the  proof,  it  is  sufficient  to  chose  one  ordering  arbitrarily.  Consider  MqE 
the  order-maximizing  MDP  of  O .  Mq  is  constructed  so  as  to  send  as  much 
probability  mass  as  possible  to  states  earlier  in  the  ordering  O ,  i.e.  to  those 
states  q  with  lower  value  ^{q) .  It  follows  that  for  any  state  p , 

X  (36) 

Thus,  for  any  state  p , 


V'«,„(p)  =  R(p)+y  X 

[F^,iMp))V^Jq)] 

(37) 

q^Q 

V  / 

>R{p)-\-'H  ^  ( 

>"»(It(p))V„,,(9)') 

(38) 

^  / 

=  VlM„n(VM,n 

:)(P) 

(39) 

By  Theorem  6,  these  lines  imply  V ^  <dom  7t  >  desired. 

The  existence  of  M2  can  be  shown  in  the  same  except  that  O  is  chosen  to  order 
the  states  by  increasing  value.  Thus  Mq  '\s  constructed  so  that 

X  fe(n(p))V„,,(?)ls  X  fe(n(P))''A/,»(9)l  ■  (40) 
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Part  (b)  is  shown  in  the  same  manner  as  part  (a)  except  that  we  replace  each 
occurrence  of  ^M,  ^(p)  with  „(v)(p)  and  each  occurrence  of 
with  v(^) . 

□  (Lemma  1) 


Lemma  2:  Let  7i  be  a  policy  in  H  and  Mj,  be  MDPs  in  Afj . 

(a) ForM3  =  M,@«^M2, 

n  — dom  7t  n  -dom  JC  ’ 

(b)  for  M3  =  M^e^n  M2, 

31  Jt  3t  ~dom  Jt  •  (^^2) 

Proof:  Part  (a):  We  construct  a  value  function  v  such  that  v  >dom  V^A/„3t’ 

V  >dom  n  >  and  V  <do„  ^ ,  as  follows.  For  each  p  e  2 ,  let 

v(p)  =  maxCV^^^  jtCp),  „(p))  (43) 

Note  that  this  implies  „  and  v>dom  We  now  show  using 

Theorem 6  that  v <dom  , 31  •  Theorem 6  it ’suffices  to  prove  that 

V  <don,  Vlj^f  „(v) ,  which  we  now  do  by  showing  v(p)  <  VI Jy)ip)  for  arbi¬ 
trary  pe  ^ . 

Case  1.  We  suppose  Vf^^^  ^{p)  >  ^(p) . 

From  Equation  43  we  then  have  that  v(p)  =  V^^  ,j^(p).  By  the  definition  of 
®max’  we  know  F^^inip))  =  F^^{n{p))  when  'v as  in 
this  case.  This  fact,  together  with  the  definitions  of  VI,  ^M„3t’  ®Sax’  and  V 
allow  the  following  chain  of  equations  to  conclude  the  proof  of  case  1: 

'’(P)  =  ''M„n(P) 


=  R(P)+Y  E 
qe  Q 

<  /?(p)  +  Y  5^  Fp^{n{p))v{q)  (44) 

qs.  Q 

=  ^(P)  +  Y  S 

qB  Q 

=  Wm, .,(”)(?) 

Care  2;  Suppose  ,(p)  <  V„^  ,(p). 
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We  then  have  F^^Knip))  =  F^Hn{p))  by  the  definition  of  ,  and 
v(j))  =  y M yj^hy  the  definition  of  v ,  and  Equation  44  holds  with  M  j  replaced 
by  Af  2 ,  as  (lesired,  concluding  the  proof  of  part  (a). 

Part  (b):  The  proof  is  exactly  dual  to  part  (a)  by  replacing  “max”  with  “min, 
<  with  >  (and  vice  versa),  and  <  with  >. 

□  (Lemma  2). 


Lemma  3:  Given  a  BMDP  Mj ,  and  policies  tr  j,  tIj  e  n ,  713  =  n,  @„pj  7I2 ,  and 

7I4  =  Tlj  ©pgj  Jt2, 

(^)  — dom  ^om 

(b)  If  VTn=^tn,  then  and 

(t')  —dom  and  ^4-^4  ^om  y-^}i2 

(d)  lfVi^=Vi^^  then  Vi^^>pesVi^^  and  . 

Proof:  Part  (a):  We  prove  part  (a)  of  the  lemma  by  constructing  a  value  func¬ 
tion  V  such  that  v  >do„,  and  v  .  We  then  show  that  v  ^1^3 

using  Theorem  6.  We  construct  v  as  follows.  Let  v(p)  =  max(Vtn  (P)’  (P)) 
for  each  pe  Q  . 

This  construction  implies  that  v  and  v  •  We  now  show 

^  ^dom  by  giving  an  MDP  M3  for  which  >(,o„  v  .^Using  Theorem  6 

it  suffices  to  show  that  VI^^,  7r3('^)  -dom  v.  ^ 

Let  Mj  e  M|  be  a  tTj  -maximizing  MDP,  and  M2  6  Mj  be  a  7I2 -maximizing 

MDP.  Note  that  this  implies  that  Vtn.  =  and 

We  now  construct  M3  €  Mj  as  follows:  for  each  p,q,a, 

F^^{a)  otherwise 

It  remains  to  show  that  ^  ^(P)  for  all  P  e  (2  •  Now  fix  an  arbi¬ 

trary  pE  Q. 

Case  1:  Suppose  y,„^(p) >„p,  V^^(,p)  . 

Then  by  the  definition  of  ©j,pj ,  7i3(p)  =  7ij(p) .  Also,  by  the  definition  of  >op, , 
^tn,(P)  -  and  so  v(p)  =  yM^,n^ip)  is  true,  and  by  the  definition  of 

^3  >  =  ^pq('^3(p))  ■  The  following  inequations  thus  hold: 

v(p)  =  y^nSP'>  (45) 
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(46) 


=  R(P)+T  X  (^■",'(>ti(p))V, ,,(«)) 

Q 

=  «(P)  + 1  X  (F",’(%(P))^t»,(4)) 

Q 

S  R(p)+Y  X  (<,’(Jt3(p))v(«))  (48) 

q^  Q 

=  (49) 

Case  2:  Suppose  <opt  • 

Then  by  the  definition  of  ©^pj ,  7i3(p)  =  .  Also,  by  the  definition  of  >opt , 

(P)  ^  ^(P)  =  ii2(P)  definition  of 

^3  ’  '^^^(^3(P))  ~  ^pq(^3(P))  •  Then  Equation  45  thru  Equation  49  hold  with 
M2  and  7I2  in  place  of  Mj  and  itj  respectively,  yielding  again  that 
v(p)  <  y/^^  njC^XP)  ’  as  desired. 

Case  1  and  Case  2  together  imply  that  v(p) < Qy 
which  with  Theorem  6  implies  part  (a)  of  the  lemma. 

Proof:  Part  (b):  Supposing  that  we  show  yjjj^^p,  and 

Vt^  >opt  y^itj  •  From  part  (a)  of  the  theorem,  we  know  that  >doDi  and 
Ffn  ^dom  Vtjj  .  It  suffices  to  prove  in  addition  that  >dom  and 
Fijt]  >dom  Vj-n^ .  We  show  both  by  defining  v(p)  =  max(yi^^(p),  Vj-n^Cp))  for 
each  state  pe  Q,  observing  that  v  >dom  yi^i  and  v  >dom  Vln^ .  and  then  showing 
that  >dom  V . 

We  can  show  yj.^^  >dom  v  by  showing  that  for  arbitrary  M  e  Mi ,  >dom  v . 

By  Theorem  6  it  suffices  to  show  that  for  arbitrary  state  pe  Q,  VI np)iP) 

>  V .  We  divide  now  into  two  cases: 

Case  1:  Suppose  y^n^Cp)  ^  F4,„^(p) . 

With  the  part  (b)  assumption  (ytjt^  =  yr„^ ),  this  implies  y^^^fp)  >op,  y^^^Cp)  . 
Then  by  the  definition  of  ©^pj,  n^^p)  =  Jii(p) .  Also  by  definition  in  this  case 
v(p)  =  yt„j(p).  Let  M  j  be  a  tIj -minimizing  MDP.  The  following  inequation 
chain  gives  the  desired  conclusion: 

v(p)  =  Vi„^(p)  (50) 

=  fi(p)  (’'i(P))''4.,(9)  (51) 

q^  Q 
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(52) 


S«(p)  +  y5; 

G 

<R(p)  +  y  F^^(n^(p))v(q)  (53) 

i2 

^yiM,nly)iP)  (54) 

Line  52  requires  some  justification.  Consider  an  MDP  M[  defined  to  agree 
with  M  j  everywhere  except  that  for  every  qe  Q  Af  Line  52  did 

not  hold,  we  would  have  ^  and  then  Theorem  6  could  be 

used  to  show  that  <dom  >  contradicting  the  definition  of  Vij^^ . 

Case  2:  Suppose  Vi^^Cp)  <  Vi^^ip) . 

With  the  part  (b)  assumption  this  implies  that  (p)  <op,  (p)  . 

Then  by  the  definition  of  ©^pj ,  n^ip)  =  7i2(p)  •  Also  v(p)  =  Vi^^Jip) .  Let  M2 
be  a  712 -minimizing  MDP.  Equations  50  through  54  now  hold  with  Mj  and  Ttj 
replaced  by  M2  and  7I2 ,  respectively. 

We  have  now  shown  in  both  cases  that  v(p)  <  VIj^  (v)(p) ,  as  desired,  con¬ 
cluding  the  proof  of  part  (b)  of  the  theorem. 

Proof:  Part  (c):  We  prove  part  (c)  of  the  lemma  by  constructing  a  value  func¬ 
tion  V  such  that  v  >do„,  and  v  >do„  .  We  then  show  that  v  <<jon, 
using  Theorem  6.  We  construct  v  as  follows.  Let  v(p)  =  max(Vj.„  (p),  Vi^j  (p)) 
for  each  p^  Q.  This  implies  v and  v .  We  now  show 
v^dom^in4  by  showing  that  for  arbitrary  M  e  Mj ,  Using 

Theorem  6  it  suffices  to  show  that  ^^(v)  >(jon,  v . 

Let  Mj  e  Mj  be  a  7ij -minimizing  MDP,  and  M2  e  Afj  be  a  7I2 -minimizing 
MDP.  Note  that  this  implies  that  and  jj  . 

Now  fix  an  arbitrary  pE  Q  ,  and  show  that  7t^('^)(p) ^  v(p) . 

Case  1:  Suppose  Vj„^(p)  ^3  . 

Then  by  the  definition  of  ©p^^ ,  n^ji^p)  =  Ttjfp) .  Also,  by  the  definition  of  >pes , 
Vj.„^(p)  >  Vij^Jip),  and  so  v(p)  =  ^^Cp)  is  true.  Equations  50  through  54 

now  hold  with  Tt^  in  place  of  Ttj ,  giving  the  desired  result. 

Case  2:  Suppose  H^^Cp)  <pes  Vj^^(p) . 

Then  by  the  definition  of  ©j^^ ,  n^(p)  =  7i2(p) .  Also,  by  the  definition  of  ^,^3 , 
Vij^^ip)  <  Vi^i^p),  and  so  v(p)  =  is  true.  Then  Equations  50 

through  54  hold  with  M2,  7t2,  and  in  place  of  Mj ,  7tj ,  and  7I3,  respec¬ 
tively,  yielding  again  that  v(p)  <  7t/v)(p) ,  as  desired. 
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Case  1  and  Case  2  together  imply  that  v(p)  <  7t/v)(p)  for  all  pe  Q, 

which  with  Theorem  6  implies  part  (c)  of  the  theorem. 

Proof:  Part  (d):  Supposing  that  14^  =Vi„^,weshow  and 

.  From  part  (c)  of  the  theorem,  we  know  that  ^om  Vi-m 
^•1-714 -dom  •  It  suffices  to  prove  in  addition  that  Vtn4  ^dom 
VTn^^domVtjtj-  We  show  both  by  defining  v(p)  =  max(Vtn,(p),  for 

each  state  p  &  Q,  observing  that  v  ^on,  and  v  >don,  ,  and  then  showing 
that  Vtrt^>domV  by  giving  an  MDP  for  which  V^^,jt4^omV-  Using 
Theorem  6  it  suffices  to  show  that  V7w  _  (v)  >dom  v . 

Let  Mj  e  A/|  be  a  Ttj  -maximizing  MDP,  and  M2  e  Afj  be  a  7I2 -maximizing 
MDP.  Note  that  this  implies  that  Vtn,  =  =  U^2,7t2- 

We  now  construct  e  Mj  as  follows:  for  each  p,q,a, 

F"'(a)  ifF,„(p)>^Vi,^(p) 

F’^Ha)  otherwise 

It  remains  to  show  that  V7^^  jt/v)(p) ^  v(p)  for  all  p^  Q .  Now  fix  an  arbi¬ 
trary  pe  Q. 

Case  1:  Suppose  VTjt,(P)  ^pes  • 

With  the  part  (d)  assumption  this  implies  that  V^j^^ip)  ^  • 

Then  by  the  definition  of  ©pg^ ,  Jt4(p)  =  TtjCp) .  Also  by  definition  in  this  case 
v(p)  =  Vr^iCP)-  Also,  by  the  definition  of  M4,  F^^\n^(p))  =  F^^'in^ip)). 
Equations  45  through  49  with  7I3  and  M3  replaced  by  714  and  M4  complete  the 
argument. 

Case  2:  Suppose  VTn,(p)  <  • 

With  the  part  (d)  assumption  this  implies  that  Vj^^(p)  <opt  • 

Then  by  definition  7i4(p)  =  712(7^)-  Also  v(p)  =  V^j^Jip).  Equations  45 
through  49  now  hold  with  Mj,  Tij,  and  7I3  replaced  by  M2,  7T2,  and  7I4, 
respectively. 

We  have  now  shown  in  both  cases  that  v(p)  <  VI ^^(v)(p) ,  as  desired,  con¬ 
cluding  the  proof  of  part  (d)  of  the  theorem. 

□  (Lemma  3). 

Theorem  9:  For  any  BMDP  Mj ,  at  every  state  p , 

VjopXp)  =  max  [min  y7;^^  ^CVioptXp),  max  y7J^^  „(VTopt)(p)l.  (55) 

a  e  A,  ^  e  Mj  ’  M  e  Mj  -J 
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and 


^  pes  (p)  ~ 


max  min  max  VI^  a(VTpes)(p) 

as  A,  €  Mi  M€  Mi 


(56) 


Proof:  We  consider  the  Vjop,  version  only.  Throughout  this  proof  we  assume 
Jtopt  is  an  optimistically  optimal  policy  for  Mj ,  which  exists  by  Theorem  8.  We 
suppose  Equation  55  is  false  and  show  a  contradiction.  We  have  two  cases: 

Case  1:  Suppose  the  upper  bounds  are  not  equal  at  some  state  p : 

Vtop,(p)  *  max  max  ■  (57) 

ae  A  M  €  Mi 

There  are  two  ways  this  can  happen: 

Subcase  la:  Suppose  there  exist  some  MDP  M  E  Mi  and  action  ae  A  such 
that 

V'Topt(p)  <  a(VTop,)(p)  (58) 


We  show  how  to  construct  a  policy  ti  whose  interval  value  dominates  Vj^p, 
under  ^p,  ,  contradicting  the  definition  of  Vj  ^p, .  Define  7i  to  be  the  same  as 
Ttopi  except  that  n(p)  =  a.  By  the  definition  of  ,  there  must  exist 
M'e  Ml  such  that  Vtop,  =  =  V From  the  theory  of  exact  MDPs, 

we  then  have  that: 


''T.p,  = 


Our  subcase  assumption  implies 


(59) 


(60) 


Consider  the  MDP  M3  E  Mi  with  the  same  parameters  as  M'  except  at  state  p 
where  the  parameters  are  given  by  M .  More  formally, 


when  p' =  p 
otherwise 


(61) 


This  construction  of  M3 ,  together  with  Equation  59  and  Equation  60,  guaran¬ 
tees  the  following  property  of  Vtopt  • 

l^Topt  ^dom  Jt(^topt)  (62) 
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Equation  62  along  with  Theorem  6  implies  that  ^  >dom  ^topt  and  thus  that 
H„>op.Vtop.  ,  contradicting  the  definition  of  Vjopt  and  concluding  Subcase  la. 

Subcase  lb.  Suppose  that  for  every  choice  of  a  €  A  and  M  e 

Vt<,,(p)>V7j,,„(VT^)(p).  (63) 

We  obtain  a  contradiction  directly  by  exhibiting  a  and  M  e  in  violation  of 
this  supposition.  Let  a  be  n^ptip) .  Let  M  be  a  tiop, -maximizing  MDP  in  , 
which  exists  by  Theorem  7.  Our  selection  of  Tiop,  guarantees  that  =  ^Topt  > 
and  our  choice  of  M  guarantees  that  =  VtTi  •  Equations  7  and  8  from 

the  theory  of  exact  MDPs  then  ensure  that”  VtoptCpT  =  a('^topt)(p) »  con¬ 
cluding  case  1. 

Case  2.  Suppose  at  every  state  q  the  upper  bounds  are  equal  but  at  some  state  p 
the  lower  bounds  are  not  equal: 


for  all  q,  VtoptC^)  = 

max 

ae  A 

max  a(^Topt)(^).  and 

M€  M, 

VioptCP)  ^ 

max 

min  WM.cxC^-toptXP) 

a€Pv^^(p)  M&Mx 


Note  that  the  action  selection  in  the  second  line  of  Equation  64  is  restricted  to 
range  over  those  actions  in  (p)  because  those  are  the  only  actions  that  can 
be  selected  in  Equation  55  due°to  the  emphasis  of  <opt  on  upper  bounds  (the 
upper  bounds  achievable  by  an  action  primarily  determine  whether  it  is  selected 
by  the  outer  maximization  in  Equation  55,  and  only  if  the  action  is  tied  for  the 
maximum  upper  bound,  i.e.  in  P  >  does  its  lower  bound  affect  the  maximi¬ 

zation). 

Again,  there  are  two  ways  the  second  line  of  Equation  64  can  hold. 

Subcase  2a.  Suppose  VioptCp)  is  too  small,  i.e.,  there  exists  some  action 
a  e  p  V  (p)  such  that  for  every  MDP  M  e  Mi,  we  have 

Vi,,,(p)<VlM,a(y^opdip)-  (65) 

We  show  a  contradiction  by  giving  a  policy  7i  whose  interval  value  function  is 
greater  than  Vj  op,  under  the  <op,  ordering.  Define  7t  to  be  the  same  as  Ttop, 
except  that  n(p)  =  a.  By  the  definition  of  1^^,  there  must  exist  M'e  Mi 
such  that  Vtopt  =  ^ .  As  in  Subcase  la,  we  then  have  that: 

=  v„,  ^  =  V/„.  (66) 

From  Equation  64  and  a  E  Pvt^(P)  it  follows  that  for  some  M  e  Mi, 

v,.„(p)  =  W„,„(V,.„)(P),  (67) 
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and  thus  for  M3  e  Mj  defined  as  in  Subcase  la  to  be  equal  to  M'  everywhere 
except  at  state  p  where  M3  is  equal  to  M ,  we  have 

(68) 


Therefore  „  =  V-ropt ,  and  by  the  definitions  of  Vj  op,  and  » we  then  have 
that  Vtopt  ^omVt„>dom  =  ^top. ,  and  so  is  equal  to  Vtopt-  We  must 

now  show  that  >doni  Viopt  to  conclude  Subcase  2a.  We  show  this  by  show¬ 
ing  that  for  every  MDP  M^eMx,  Viop,  <dom  ^"d  using 

Theorem  6  to  conclude  „  >dom  Viopt  and  thus  >aom  as  desired. 

To  conclude  Subcase  2a,  then,  we  must  show  Viop,  <doin  n(14opi)-  We 
show  this  by  contradiction.  Suppose  this  is  false  —  then  either 
Vtopt  =  n^Vi-opt) .  which  our  Subcase  2a  assumption  rules  out  at  state  p ,  or 
there  must  be  some  state  q  for  which  ViopM)  >  n(Viopt)(^)  •  Again  our 
Subcase  assumption  rules  this  out  for  state  p ,  so  we  know  that  q  is  not  equal  to 
p ,  and  therefore  by  our  choice  of  n  we  have  that  7i(^)  =  Tio^{q) ,  and  thus  that 
Vi.opi('?)  >  n  (Viopt)(^)  •  We  can  now  derive  a  contradiction  by  combining 
M4  at  state  ^^wTth  a  Jtop, -minimizing  MDP  M5  at  all  other  states  to  get  an 
MDP  Mg  €  Mi  for  which  Viopt  strictly  dominates  VI jt^(Viopt) .  showing  that 
Vtopt  >dom  yM^,  n  Theorem  6)  contradicting  the  fac°t  that  =  Vj-op, . 
(The  combination  of  M4  and  M5  to  get  Mg  is  analogous  to  the  construction  in 
Line  61  above). 


Subcase  2b.  Suppose  Vi^^^ip)  is  “too  big”  in  Line  64,  i.e.,  for  every  action 
a  E  P  V3^(p)  there  is  some  MDP  M^e  Mj  such  that  yi^^,  a(  Vtopt)(p)  <  Viopt(p)  • 

Consider  a  =  7iop,(p) .  The  definition  of  “optimistically  optimal”  along  with  the 
theory  of  exact  MDPs  guarantees  us  that  there  is  some  MDP  M  such  that 


Vtopt  -  Vr,t_^  -  yM,K^  -  y^M.njy M,nj  “  y^M.njy'fopd 


(69) 


By  our  case  2  assumption, 

VTopt(p)  =  max  max  V/^  „(VTopt)(p) ,  (70) 

ae  A  M  e  M; 

and  this,  together  with  Line  69  and  a  =  7top,(p)  implies 

y^M  n  (Vropt)(p)  =  max  max  V/^^  „(VTopt)(p) ,  (71) 

’  ae  A  Me  M: 


and  therefore  that 

7top,(p)  €  argmax  max  VI a(VTop.)(p) .  (72) 

a  €  A  M  e  Ml 
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which  implies  that  a=7iopt(p)€  Py^^(p).  We  can  then  use  our  subcase 
assumption  that  there  must  be°^  an  MDP  Mx  such  that 

^4„v(^v)(p)<^v(p)- 

Let  M’j  be  a  Ttop,  -minimizing  MDP,  as  per  Theorem  7.  Then  ^  = 

Viopt  by  expanding  definitions.  So  ji;^(^J-opt)  =  Viop, .  We  can  now  create 
a  new  MDP  Mg  by  copying  at  every  state  except  p,  where  Mg  copies 
M„ ,  following  the  construction  used  to  define  M3  in  Subcase  la.  By  construc¬ 
tion  we  then  have 

which  by  Theorem  6  implies  <dom  Viopi »  contradicting  our  choice  of  TCopt 
and  concluding  Subcase  2b,  Case  2,  and  the  proof  of  Theorem  9. 

□  (Theorem  9). 

Theorem  10:  For  any  policy  7t ,  IVIi^  and  IVI-Xj^  are  contraction  mappings. 

Proof:  We  first  show  that  7V7-r^  is  a  contraction  mapping  on  V,  the  space  of 
value  functions.  Strictly  speaking,  is  a  mapping  from  an  interval  value 
function  Vj  to  a  value  function  V .  However,  the  specific  values  V{p)  only 
depend  on  the  upper  bounds  Vr  of  Vt  ■  Therefore,  the  mapping  /V7tjj  is  isomor¬ 
phic  to  a  function  that  maps  value  functions  to  value  functions  and  with  some 
abuse  of  terminology,  we  can  consider  to  be  such  a  mapping.  The  same  is 
true  for  ,  which  depends  only  on  the  lower  bounds  Vi, . 

Let  u  and  be  interval  value  functions,  fix  pe  Q ,  and  assume  that 
/V7r jj(v)O)  >  IVIxj^(u)(p) .  Let  M  be  an  MDP  M  e  that  maximizes  the 
expression  jt(vt)(p)  (Lemma  1  implies  that  there  is  such  an  MDP  in  the 
finite  set  ,  guaranteeing  the  existence  of  M  in  spite  of  the  infinite  cardinal¬ 
ity  of  Mj ). 

Then, 

0  <  /V/t„(v)(p)  -  IVh^mp)  (74) 


=  max  V/3^  ^(vt)(p)  -  max  V/^  ,j(mt)(p) 


(75) 


Me  M 


Me  M 


<  7?(p)  +  y(^  F^q{Tt{p))vxiq^  -R{p)-  y(^  X^7^^(7t(p))MT(9)j  (76) 


=  y(  2  7^g(^(P))[VT(^)-MT(^)] 

Q 


(77) 
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(78) 


-Y[  S  ^9(^(/^))ll^t-MTll 

Q 

=  yIIvt  -  mtII  •  (79) 

Line  75  expands  the  definition  of  /V/t^  •  Line  76  follows  by  expanding  the  defi¬ 
nition  of  VI  and  from  the  fact  that  M  maximizes  Vl^f  j^(vf)(p)  by  definition. 
In  Line  77,  we  simplify  the  expression  by  cancelling  the  immediate  reward 
terms  and  factoring  out  the  coefficients  .  In  Line  78,  we  introduce  an  ine¬ 
quality  by  replacing  the  term  vt(^)  -  ui(q)  with  the  maximum  difference  over 
^1  states,  which  by  definition  is  the  sup  norm.  The  final  step  Line  79  follows 
from  the  fact  that  F  is  a  probability  distribution  that  sums  to  1  and  ||vt  -  wtll 
does  not  depend  on  q . 

Repeating  this  argument  interchanging  the  roles  of  u  and  \hatv  in  the  case  that 
/WT„(v)(p)</y/T„(fi)(p)  implies 

|/Wt„(v)(p)  -  /Wt;,(m)(p)|  <  yIIvt  -  Htll  (80) 

for  all  p  6  (2  •  Taking  the  maximum  over  p  in  the  above  expression  gives  the 
result. 

The  proof  that  IVh^  is  a  contraction  mapping  is  very  similar,  replacing  /VYtj^ 
with  IVhjj^  throughout,  replacing  maximization  with  minimization  in  Line  74, 
and  selecting  MDP  M  to  minimize  the  expression  VIj^  jj(Mt)(p)  when 
IVh^{v)ip)>IVh^mp)- 

□  (Theorem  10). 


Theorem  11:  For  any  policy  Ti ,  Vijj  is  a  fixed-point  of  rVIi^^  and  Vt^  of  TV/tjj  , 
and  therefore  Vj„  is  a  fixed-point  of  /VYj^ . 

Proof:  We  prove  the  theorem  for  /F/ijj ;  the  proof  for  /V7i-^  is  similar.  We 
show 

(a) /Wi^(Vt^)<,,„Vi„,and 

(b)  /V7i„(yt^)>d0.n„, 

from  which  we  conclude  that  IVh^{Vx^  =  V:i„.  Throughout  both  cases  we 
take  M*  to  be  a  7i -minimizing  MDP,  so  that  =  V •  By  Theorem  1  M* 

must  exist. 

We  first  prove  (a).  From  Theorem  3,  we  know  that  is  a  fixed  point  of 

VI ^ .  Thus,  for  any  state  qe  Q, 

ViM  =  =  V/„.  ,(W.)(«).  (81) 
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Using  this  fact  and  expanding  the  definition  of  ,  we  have,  at  every  state  q , 
lVh„(VQ(,q)  =  min  ,(Vi„)(4) 

Me  Mj 

=  Vi^(q). 

This  implies  that  IVhjyx^  <dom  Vijj  as  desired. 

To  prove  (b),  suppose  for  sake  of  contradiction  that  for  some  state  p, 
TVli^iVx^ip)  <  Vij^ip) .  Let  Mj  e  be  an  MDP  that  minimizes^  the  expres¬ 
sion 

Then,  substituting  M  j  into  the  definition  of  /V7i„, 

IVh^m^Xp)  =  „(Vi^)(p)  <  Vi^ip) .  (83) 

We  can  then  construct  an  MDP  Af2  by  copying  M*  at  every  state  except  p, 
where  M2  copies  Af  j  (see  the  proof  of  Theorem  9,  Case  la  for  the  details  of  a 
similar  construction).  Because  M2  is  a  copy  of  M*  at  every  state  but  p. 
Equation  81  must  hold  with  M2  replacing  M*  at  every  state  but  p.  Because 
M2  is  a  copy  of  M  j  at  state  p ,  Equation  83  with  M2  replacing  M  j  must  hold 
at  state  p .  These  two  facts  together  imply 

(84) 

Then  by  Theorem  6  „  <dom  Vin  >  contradicting  the  definition  of  . 

□  (Theorem  11). 


Theorem  13: 

(a)  /V7topt  and  IVh^^  are  contraction  mappings. 

(b)  For  any  value  function  V  and  associated  action  set  selection  function 
and  Cy ,  /V7iQpj  y  and  /V/tpes,  v  contraction  mappings. 

Proof:  We  first  prove  (a).  The  proof  that  a  contraction  mapping  is  an 

extension  of  the  proof  of  Theorem  10.  Let  u  and  D  be  interval  value  functions, 
fix  pE  Q,  and  assume  that  /V7topt(v)(p)  ^  IVIxqj,i(u)(p)  .  Select  M  e  Mi  and 
a  G  A  to  maximize  the  expression  VIm,  a(^t)(p)  Ugain,  Lemma  1  implies  that 


6.  Such  an  MDP  exists  by  Lemma  1,  which  implies  that  there  must  be  such  an  MDP  in  the  finite  set 
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there  is  such  an  MDP  in  the  finite  set  ,  guaranteeing  the  existence  of  M  in 
spite  of  the  infinite  cardinality  of  Afj ). 

Then, 

0  <  /WT„pt(v)(p)  -  /V/Topt(fi)(p)  (85) 

=  max  max  aCvrXp)  -  max  max  VI ^  a(MT)(p)  (86) 

o,e  A  M  €  Ml  ’  ae  A  M  e  Ml 

< R{p)  +  yf  2  F^g(a)vT(9)')  -  /?(p)  -  vf  X  ^g(«)“T(9)l  (87) 

^ q  e  Q  ^  ^q^  Q  ^ 

<  yIIvt  -  mtII  •  (^^) 

Line  86  expands  the  definition  of  /VYtopf ,  noting  that  maximizing  using  <op, 
selects  interval  upper  bounds  based  only  on  the  upper  bounds  of  the  input  inter¬ 
vals.  Line  87  follows  from  our  choice  of  M  and  a  to  maximize  V/^  ct(vt)(p)  • 
Line  88  follows  from  Line  87  in  the  same  manner  that  Line  79  followed  from 
Line  76  in  the  proof  of  Theorem  10,  and  the  desired  result  for  /V/j^pj  for  part 
(a)  of  the  theorem  also  follow  in  the  same  manner  as  the  remainder  of 
Theorem  10  followed  from  Line  79. 

To  prove  that  is  a  contraction  mapping,  we  again  fix  a  state  p  and 

assume  IVli^^(v)ip)  >  IVI i^^^(u)(p)  .  We  then  use  vi  to  choose  an  action  a 
that  maximizes  min^g  choose  an  MDP  M  that 

minimizes  a(Mi)(p)  (again.  Lemma  1  implies  that  there  is  such  an  MDP 
in  the  finite  set  ’Xy^^. ,  guaranteeing  the  existence  of  M).  Using  a  and  M  as 


defined  above,  we  have 

0  <  IVh^,{V){p)  -  /y/ipes(M)(p)  (89) 

=  max  min  „(vi)(p)  -  max  min  VI^^^{ui){p)  (90) 

a6i4M6Mj  ’  A  M  €  M- 

<  min  y/^  „(vi)(p)  -  min  „(Mi)(p)  (91) 

M  6  Af;  ’  Me  M: 

^  VI M,  a(vi)(P)  -  VI M,  a(«0(p)  (92) 


Line  90  expands  the  definition  of  /V7ipes .  using  the  fact  that  maximizing  over 
selects  lower  bounds  based  only  on  the  lower  bounds  of  the  intervals 
being  maximized  over.  Line  91  substitutes  the  action  a,  which  introduces  the 
inequality  since  a  was  chosen  to  guarantee 
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(93) 


min  VIj^  Jivi){p)  =  max  min  V/^  „(vi)(p), 

MeMt  ’  aeA  MeMt 

and  the  meaning  of  maximization  guarantees  that 

min  „(Mi)(p)<max  min  VI^  JuiXp).  (94) 

M  €  Ml  CIG  A  M  €  Ml 

Line  92  follows  similarly  because  M  was  chosen  to  guarantee 

VIm  a(“i)(P)  =  J3iin  V/M,a(«0(p).  (95) 

’  Ms  Mi 

and  the  meaning  of  minimization  guarantees  that 

VI M  a(^^)(P)  ^  inin  VI M  JyiXp) .  (96) 

MeMt 

The  desired  result  for  /V7jpes  in  part  (a)  of  the  theorem  then  follows  directly 
from  Line  92  in  the  same  manner  as  the  result  for  followed  from 

Line  86,  concluding  the  proof  of  part  (a)  of  the  theorem. 

For  part  (b),  the  proof  for  IVIi^^t  y  follows  exactly  as  the  proof  for  /Wipes  ’ 
except  that  the  set  of  actions  considered  in  the  maximization  over  actions  at 
each  state  p  is  restricted  to  PyO) .  Likewise,  proving  y  is  the  same  as 

proving  /V/fopt  where  the  set  of  actions  is  restricted  to  Oyip) . 

□  (Theorem  13). 
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Abstract.  In  this  paper,  we  introduce  the  notion  of  an  hounded  param- 
eter  Markov  decision  process  (BMDP)  as  a  generalization  of  the  familiar 
earacf  MDP.  A  bounded  parameter  MDP  is  a  set  of  exact  MDPs  spec¬ 
ified  by  giving  upper  and  lower  boimds  on  transition  probabilities  and 
rewards  (all  the  MDPs  in  the  set  sheire  the  same  state  and  action  space). 
BMDPs  form  an  efficiently  solvable  special  case  of  the  already  known 
class  of  MDPs  with  imprecise  parameters  (MDPIPs).  Boimded  parame¬ 
ter  MDPs  can  be  used  to  represent  variation  or  imcertainty  concerning 
the  parameters  of  sequential  decision  problems  in  cases  where  no  prior 
probabilities  on  the  parameter  values  are  available.  Boimded  parameter 
MDPs  can  also  be  used  in  aggregation  schemes  to  represent  the  varia¬ 
tion  in  the  transition  probabilities  for  different  base  states  aggregated 
together  in  the  some  aggregate  state. 

We  introduce  interval  value  functions  as  a  natural  extension  of  tradi¬ 
tional  value  functions.  An  interveJ  value  function  assigns  a  closed  real 
interval  to  each  state,  representing  the  assertion  that  the  vedue  of  that 
state  falls  within  that  interval.  An  interval  value  function  can  be  used 
to  bound  the  performance  of  a  policy  over  the  set  of  exact  MDPs  asso¬ 
ciated  with  a  given  boimded  parameter  MDP.  We  describe  an  iterative 
dynamic  progrzimming  edgorithm  called  interval  policy  evaluation  which 
computes  an  interval  value  function  for  a  given  BMDP  and  specified  pol¬ 
icy.  Interval  policy  evaluation  on  a  policy  tt  computes  the  most  restrictive 
interval  value  function  that  is  sound,  i,e.,  that  boimds  the  value  function 
for  TT  in  every  exact  MDP  in  the  set  defined  by  the  boimded  parameter 
MDP.  We  define  optimistic  and  pessimistic  notions  of  optimal  policy,  and 
provide  a  variant  of  value  iteration  [Bellman,  1957]  that  we  caU  interval 
value  iteration  which  computes  a  policies  for  a  BMDP  that  are  optimal 
in  these  senses. 


1  Introduction 

The  theory  of  Markov  decision  processes  (MDPs)  provides  the  semantic  founda¬ 
tions  for  a  wide  range  of  problems  involving  planning  under  uncertainty  [Boutilier 
et  a/.,  1995a,  Littman,  1997].  In  this  paper,  we  introduce  a  generalization  of 
Markov  decision  processes  called  bounded  parameter  Markov  decision  processes 
(BMDPs)  that  allows  us  to  model  uncertainty  in  the  parameters  that  comprise 


an  MDP.  Instead  of  encoding  a  parameter  such  as  the  probability  of  making  a 
transition  from  one  state  to  another  as  a  single  number,  we  specify  a  range  of 
possible  values  for  the  parameter  as  a  closed  interval  of  the  real  numbers. 

A  BMDP  can  be  thought  of  as  a  family  of  traditional  (exact)  MDPs,  t.e., 
the  set  of  all  MDPs  whose  parameters  fall  within  the  specified  ranges.  From  this 
perspective,  we  may  have  no  justification  for  committing  to  a  particular  MDP 
in  this  family,  and  wish  to  analyze  the  consequences  of  this  lack  of  commitment. 
Another  interpretation  for  a  BMDP  is  that  the  states  of  the  BMDP  actually 
represent  sets  (aggregates)  of  more  primitive  states  that  we  choose  to  group 
together.  The  intervals  here  represent  the  ranges  of  the  parameters  over  the 
primitive  states  belonging  to  the  aggregates.  While  any  policy  on  the  original 
(primitive)  states  induces  a  stationary  distribution  over  those  states  which  can 
be  used  to  give  prior  probabilities  to  the  different  transition  probabilities  in  the 
intervals,  we  may  be  unable  to  compute  these  prior  probabilities — the  original 
reason  for  aggregating  the  states  is  typically  to  avoid  such  expensive  computation 
over  the  original  large  state  space. 

BMDPs  are  a  efficiently  solvable  specialization  of  the  already  known  Markov 
Decision  Processes  with  Imprecisely  Known  Transition  Probabilities  (MDPIPs). 
In  the  related  work  section  we  discuss  in  more  detail  how  BMDPs  relate  to 
MDPIPs. 

In  a  related  paper,  we  have  shown  how  BMDPs  can  be  used  as  part  of  a 
strategy  for  efficiently  approximating  the  solution  of  MDPs  with  very  large  state 
spaces  and  dynamics  compactly  encoded  in  a  factored  (or  implicit)  representa¬ 
tion  [Dean  et  a/.,  1997].  In  this  paper,  we  focus  exclusively  on  BMDPs,  on  the 
BMDP  analog  of  value  functions,  called  interval  value  functions^  and  on  policy 
selection  for  a  BMDP.  We  provide  BMDP  analogs  of  the  standard  (exact)  MDP 
algorithms  for  computing  the  value  function  for  a  fixed  policy  (plan)  and  (more 
generally)  for  computing  optimal  value  functions  over  all  policies,  called  inter¬ 
val  policy  evaluation  and  interval  value  iteration  (IVI)  respectively.  We  define 
the  desired  output  values  for  these  algorithms  and  prove  that  the  algorithms 
converge  to  these  desired  values  in  polynomial-time,  for  a  fixed  discount  factor. 
Finally,  we  consider  two  different  notions  of  optimal  policy  for  an  BMDP,  and 
show  how  IVI  can  be  applied  to  extract  the  optimal  policy  for  each  notion.  The 
first  notion  of  optimality  states  that  the  desired  policy  must  perform  better  than 
any  other  under  the  assumption  that  an  adversary  selects  the  model  parameters. 
The  second  notion  requires  the  best  possible  performance  when  a  friendly  choice 
of  model  parameters  is  assumed. 

2  Exact  Markov  Decision  Processes 

An  (exact)  Markov  decision  process  Af  is  a  four  tuple  M  =  {Q^A^F^R)  where 
Q  is  a  set  of  states,  .A  is  a  set  of  actions,  is  a  reward  function  that  maps  each 
state  to  a  real  value  i?(^)/  and  F  is  a  state-transition  distribution  so  that  for 


^  The  techniques  and  results  in  this  paper  easily  generalize  to  more  general  reward 
functions.  We  adopt  a  less  general  formulation  to  simplify  the  presentation. 


a  e  A  and  p,q  E  Q, 


Fpg{a)  =  Pr(Xt+i  =  q\Xt  ^p,Ut-  a) 

where  Xt  and  Ut  are  random  variables  denoting,  respectively,  the  state  and 
action  at  time  t.  When  needed  we  will  write  denote  the  transition  function 
of  the  MDP  M. 

A  policy  is  a  mapping  from  states  to  actions,  tt  :  Q  ->  v4.  The  set  of  all 
policies  is  denoted  il.  An  MDP  M  together  with  a  fixed  policy  tt  £  11  determines 
a  Markov  chain  such  that  the  probability  of  making  a  transition  from  p  to  g  is 
defined  by  Fpq{7r{p)),  The  expected  value  function  (or  simply  the  value  function) 
associated  with  such  a  Markov  chain  is  denoted  The  value  function  maps 

each  state  to  its  expected  discounted  cumulative  reward  defined  by 

Vm,^{p)  =  •R(p)  +  7  53  ^pqi^iPWM,n{q) 

qeQ 

where  0  <  7  <  1  is  called  the  discount  rate?  In  most  contexts,  the  relevant  MDP 
is  clear  and  we  abbreviate  Vm.it  as  14- • 

The  optimal  value  function  (or  simply  V*  where  the  relevant  MDP  is 
clear)  is  defined  as  follows, 

=  max  (  R(p)  +  7  53 

The  value  function  V"*  is  greater  than  or  equal  to  any  value  function  14  in  the 
partial  order  >dom  defined  as  follows:  Vi  >dom  V2  if  and  only  if  for  all  states  g, 

Vi{q)  >  V2{q). 

An  optimal  policy  is  any  policy  tt*  for  which  V*  =  14*  •  Every  MDP  has  at 
least  one  optimal  policy,  and  the  set  of  optimal  policies  can  be  found  by  replacing 
the  max  in  the  definition  of  V*  with  arg  max. 

3  Bounded  Parameter  Markov  Decision  Processes 

An  hounded  parameter  MDP  is  a  four  tuple  M  =  (Q,A,  -F,  R)  where  Q  and  A 
are  defined  as  for  MDPs,  and  F  and  R  are  analogous  to  the  MDP  F  and  R  but 
yield  closed  real  intervals  instead  of  real  values.  That  is,  for  any  action  a  and 
states  p,  g,  R{p)  and  Fp^g{a)  are  both  closed  real  intervals  of  the  form  [/,  u]  for  I 
and  tx  real  numbers  with  /  <  u,  where  in  the  case  of  F  we  require 
To  ensure  that  F  admits  welhformed  transition  functions,  we  require  that  for 

^  In  this  paper,  we  focus  on  expected  discounted  cumulative  reward  as  a  performance 
criterion,  but  other  criteria,  e.g.,  total  or  average  reward  [Puterman,  1994],  are  also 
applicable  to  bounded  parameter  MDPs. 

^  To  simplify  the  remainder  of  the  paper,  we  assume  that  the  reward  bounds  are  always 
tight,  t.c.,  that  for  all  g  G  6,  for  some  real  i,  R{q)  =  [/,  /],  and  we  refer  to  I  as  F(g). 
The  generalization  to  nontrivial  boimds  on  rewards  is  straightforward. 


Fig.  1.  The  state-transition  diagram  for  a  simple  bounded  parameter  Markov  decision 
process  with  three  states  and  a  single  action.  The  arcs  indicate  possible  transitions  and 
are  labeled  by  their  lower  and  upper  boimds. 

any  action  a  and  state  p,  the  sum  of  the  lower  bounds  of  Fpq  (a)  over  all  states 
q  must  be  less  than  or  equal  to  1  while  the  upper  bounds  must  sum  to  a  value 
greater  than  or  equal  to  1.  Figure  1  depicts  the  state-transition  diagram  for  a 
simple  BMDP  with  three  states  and  one  action. 

A  BMDP  M  =  {Q,A,F^R)  defines  a  set  of  exact  MDPs  which,  by  abuse 
of  notation,  we  also  call  M.  For  exact  MDP  M  =  we  have 

M  e  M  if  Q  =  Q',  A  =  A\  and  for  any  action  a  and  states  R'{p)  is  in 
the  interval  R{p)  and  is  in  the  interval  Fp^q{a).  We  rely  on  context  to 

distinguish  between  the  tuple  view  of  M  and  the  exact  MDP  set  view  of  Ad.  In 
the  definitions  in  this  section,  the  BMDP  Ad  is  implicit. 

An  interval  value  function  P  is  a  mapping  from  states  to  closed  real  intervals. 
We  generally  use  such  functions  to  indicate  that  the  given  state’s  value  falls 
within  the  selected  interval.  Interval  value  functions  can  be  specified  for  both 
exact  and  BMDPs.  As  in  the  ceise  of  (exact)  value  functions,  interval  value 
functions  are  specified  with  respect  to  a  fixed  policy.  Note  that  in  the  case  of 
BMDPs  a  state  can  have  a  range  of  values  depending  on  how  the  transition 
and  reward  parameters  are  instantiated,  hence  the  need  for  an  interval  value 
function. 

For  each  of  the  interval  valued  functions  F,  F,  V  we  define  two  real  valued 
functions  which  take  the  same  arguments  and  give  the  upper  and  lower  interval 
bounds,  denoted  F,  F,  P,  an^F,  F,  V,  respectively.  So,  for  example,  at  any 
state  q  we  have  V{q)  =  [iK?)? 

Definition  1.  For  any  policy  tt  and  state  q^  we  define  the  interval  value  Vnig) 
of  TT  at  g  to  be  the  interval 

max  Fm.ttC?) 

In  Section  5  we  will  give  an  iterative  algorithm  which  we  have  proven  to  converge 
to  Vtt  .  In  preparation  for  that  discussion  we  now  state  that  there  is  at  least  one 


specific  MDP  in  M  which  simultaneously  achieves  Vjr(g)  for  all  states  g  (and 
likewise  a  specific  MDP  achieving  V^(g)  for  all  g). 

Definition  2.  For  any  policy  tt,  an  MDP  in  M  is  n -maximizing  if  it  is  a  possible 
value  of  arg  maxjvf  €  A4  Vm.tt  and  it  is  it-minimizing  if  it  is  in  arg  miuMeM  • 

Theorem  3*  For  any  policy  tt,  there  exist  ir-maximizing  and  ir-minimizing  MDPs 
in  M. 

This  theorem  implies  that  is  equivalent  to  mmMeA^  Vm.it  where  the  min¬ 
imization  is  done  relative  to  >dom»  and  likewise  for  V  using  max.  We  give  an  al¬ 
gorithm  in  Section  5  which  converges  to  by  also  converging  to  a  7r-minimizing 
MDP  in  M  (likewise  for 

We  now  consider  how  to  define  an  optimal  value  function  for  a  BMDP.  Con¬ 
sider  the  expression  maxTr^jj  Kr.  This  expression  is  ill-formed  because  we  have 
not  defined  how  to  rank  the  interval  value  functions  Vtt  in  order  to  select  a  maxi¬ 
mum.  We  focus  here  on  two  different  ways  to  order  these  value  functions,  yielding 
two  notions  of  optimal  value  function  and  optimal  policy.  Other  orderings  may 
also  yield  interesting  results. 

First,  we  define  two  diflferent  orderings  on  closed  real  intervals: 

[/l,  Ui]  <pes  [/2,  U2]  ^  \  =  /2  and  t/1  <  U2 

[^1,  «i]  <opt  [/2,  U2]  <=>  I  and  h  <  I2 

We  extend  these  orderings  to  partially  order  interval  value  functions  by  relating 
two  value  functions  Vi  <  V2  only  when  Vi{g)  <  ^2(9)  for  every  state  q.  We  can 
now  use  either  of  these  orderings  to  compute  maxTrgi/  yielding  two  definitions 
of  optimal  value  function  and  optimal  policy.  However,  since  the  orderings  are 
partial  (on  value  functions) ,  we  must  still  prove  that  the  set  of  policies  contains 
a  policy  which  achieves  the  desired  maximum  under  each  ordering  (f.e.,  a  policy 
whose  interval  value  function  is  ordered  above  that  of  every  other  policy). 

Definition 4.  The  optimistic  optimal  value  function  Vopt  and  the  pessimistic 
optimal  value  function  Vpes  are  given  by: 

Vopt  =  maxTreTT  Vn  using  <opt  to  order  interval  value  functions 
Vpes  =  max;rei7  using  <pes  to  order  interval  value  functions 

We  say  that  any  policy  tt  whose  interval  value  function  is  >opt  (^pes)  the  value 
functions  of  all  other  policies  tt'  is  optimistically  (pessimistically)  optimal. 

Theorems.  There  exists  at  least  one  optimistically  (pessimistically)  optimal 
policy,  and  therefore  the  definition  of  Vopt  (%es )  is  well-formed. 


The  above  two  notions  of  optimal  value  can  be  understood  in  terms  of  a 
game  in  which  we  choose  a  policy  tt  and  then  a  second  player  chooses  in  which 
MDP  M  in  to  evaluate  the  policy.  The  goal  is  to  get  the  highest^  resuming 
value  function  Vm.t:-  The  optimistic  optimal  value  function’s  upper  bounds  l^opt 
represent  the  best  value  function  we  can  obtain  in  this  game  if  we  assume  the 
second  player  is  cooperating  with  us.  The  pessimistic  optimal  value  function’s 
lower  bounds  Vlp^g  represent  the  best  we  can  do  if  we  assume  the  second  player 
is  our  adversary,  trying  to  minimize  the  resulting  value  function. 

In  the  next  section,  we  describe  well-known  iterative  algorithms  for  comput¬ 
ing  the  exact  MDP  optimal  value  function  V'*,  and  then  in  Section  5  we  will 
describe  similar  iterative  algorithms  which  compute  the  BMDP  variants  t4pt 
(t>pes). 

4  Estimating  Traditional  Value  Functions 

In  this  section,  we  review  the  basics  concerning  dynamic  programming  methods 
for  computing  value  functions  for  fixed  and  optimal  policies  in  traditional  MDPs. 
In  the  next  section,  we  describe  novel  algorithms  for  computing  the  interval 
analogs  of  these  value  functions  for  bounded  parameter  MDPs. 

We  present  results  from  the  theory  of  exact  MDPs  which  rely  on  the  concept 
of  normed  linear  spaces.  We  define  operators,  VIt:  and  VI ^  on  the  space  of 
value  functions.  We  then  use  the  Banach  fixed-point  theorem  (Theorem  6)  to 
show  that  iterating  these  operators  converges  to  unique  fixed-points,  14  and  V* 
respectively  (Theorems  8  and  9). 

Let  V  denote  the  set  of  value  functions  on  Q.  For  each  r  G  V,  define  the  (sup) 
norm  of  v  by 

Ill’ll  =  max|t;(9)|. 

gee 

We  use  the  term  convergence  to  mean  convergence  in  the  norm  sense.  The  space 
V  together  with  11*11  constitute  a  complete  normed  linear  space,  or  Banach  Space, 
If  [/  is  a  Banach  space,  then  an  operator  T  :U  —^U  is  &  contraction  mapping  if 
there  exists  a  A,  0  <  A  <  1  such  that  \\Tv  -  Tu\\  <  A||u  -  u\\  for  all  u  and  v  in  U, 
Define  VI  :V  —^V  and  for  each  tt  G  fl,  VI^  :  V  V  on  each  p  G  Q  by 

VJ{v){p)  =  max  I  R{p)  +  7  X) 

V  ^ 

VI^{v){p)  =  R{p)  +  7  X  Fpq{^{p))v{q). 
gee 

In  cases  where  we  need  to  make  explicit  the  MDP  from  which  the  transition 
function  F  originates,  we  write  VIm.v  and  VI m  to  denote  the  operators 
and  VI  as  just  defined,  except  that  the  transition  function  F  is  F^ , 

Using  these  operators,  we  can  rewrite  the  expression  for  V*  and  V4  as 

V-{p)  =  VI{V^){p)  and  V,{p)^Vh{V,)[p) 


^  Value  functions  are  ranked  by  >dom- 


for  all  states  p  €  Q.  This  implies  that  V*  and  are  fixed  points  of  VI  and 
respectively.  The  following  four  theorems  show  that  for  each  operator,  iterating 
the  operator  on  an  initial  value  estimate  converges  to  these  fixed  points. 

Theorem  6.  For  any  Banach  space  U  and  contraction  mapping  T  :  U  ^  (7, 
there  exists  a  unique  v*  in  U  such  that  Tv*  =  v* ;  and  for  arbitrary  in  U,  the 
sequence  {v”}  defined  by  converges  to  t?*. 

Theorem?.  VI  and  VI^  are  contraction  mappings. 

Theorem  6  and  Theorem  7  together  prove  the  following  fundamental  results 
in  the  theory  of  MDPs. 

Theorems.  There  exists  a  unique  r*  G  V  satisfying  v*  =  V I {v*);  furthermore j 
V*  =  V* ,  Similarly,  14  the  unique  fixed-point  o/ V’/tt- 

Theorem 9.  For  arbitrary  v^  G  V,  the  sequence  {u”}  defined  by  v"  =  VI{v^'~^) 
=  VI^{v^)  converges  to  V*,  Similarly,  iterating  VI-k  converges  to  14* 

An  important  consequence  of  Theorem  9  is  that  it  provides  an  algorithm  for 
finding  V*  and  14-  In  particular,  to  find  V*,  we  can  start  from  an  arbitrary 
initial  value  function  in  V,  and  repeatedly  apply  the  operator  VI  to  obtain 
the  sequence  This  algorithm  is  referred  to  as  value  iteration.  Theorem  9 

guarantees  the  convergence  of  value  iteration  to  the  optimal  value  function. 
Similarly,  we  can  specify  an  algorithm  called  policy  evaluation  which  finds  T4  by 
repeatedly  apply  VI n  starting  with  an  initial  v^  G  V. 

The  following  theorem  from  [Littman  et  al,,  1995]  states  a  convergence  rate  of 
value  iteration  and  policy  evaluation  which  can  be  derived  using  bounds  on  the 
precision  needed  to  represent  solutions  to  a  linear  program  of  limited  precision 
(each  algorithm  can  be  viewed  as  solving  a  linear  program). 

Theorem  10.  For  fixed  7,  value  iteration  and  policy  evaluation  converge  to  the 
optimal  value  function  in  a  number  of  steps  polynomial  in  the  number  of  states, 
the  number  of  actions,  and  the  number  of  bits  used  to  represent  the  MDP  pa¬ 
rameters. 


5  Estimating  Interval  Value  Functions 

In  this  section,  we  describe  dynamic  programming  algorithms  which  operate 
on  bounded  parameter  MDPs.  We  first  define  the  interval  equivalent  of  policy 
evaluation  IV  In  which  computes  14?  and  then  define  the  variants  IV I  opt  and 
IV I pes  which  compute  the  optimistic  and  pessimistic  optimal  value  functions. 


5.1  Interval  Policy  Evaluation 

In  direct  analogy  to  the  definition  of  VI in  Section  4,  we  define  a  function  IV I ^ 
(for  interval  value  iteration)  which  maps  interval  value  functions  to  other  interval 
value  functions.  We  have  proven  that  iterating  IV In  on  any  initial  interval  value 
function  produces  a  sequence  of  interval  value  functions  which  converges  to  Vn 
in  a  polynomial  number  of  steps,  given  a  fixed  discount  factore  7. 

IVIn(V)  is  an  interval  value  function,  defined  for  each  state  p  as  follows: 


IVIn{V){p)  = 


mn  VlMMp)iyy^P) 


We  define  IVI^  and  IVIn  to  be  the  corresponding  mappings  from  valuej^unc- 
tions  to  value  functions  (note  that  for  input  F,  IV I ^  does  not  depend  on  V  and 
so  can  be  viewed  as  a  function  from  V  to  V — likewise  for  IVIn  and  V). 

The  algorithm  to  compute  IVIn  is  very  similar  to  the  standard  MDP  com¬ 
putation  of  VI,  except  that  we  must  now  be  able  to  select  an  MDP  M  from 
the  family  M  which  minimizes  (maximizes)  the  value  attained.  We  select  such 
an  MDP  by  selecting  a  function  F  within  the  bounds  specified  by  F  to  mini¬ 
mize  (maximize)  the  value — each  possible  way  of  selecting  F  corresponds  to  one 
MDP  in  A4.  We  can  select  the  values  of  Fpg{a)  independently  for  each  a  and 
p,  but  the  values  selected  for  different  states  q  (for  fixed  a  and  p)  interact:  they 
must  sum  up  to  one.  We  now  show  how  to  determine,  for  fixed  a  and  p,  the 
value  of  Fpq{a)  for  each  state  g  so  as  to  minimize  (maximize)  the  expression 
SgGQ  step  constitutes  the  heart  of  the  IVI  algorithm  and 

the  only  significant  way  the  algorithm  differs  from  standard  value  iteration. 

The  idea  is  to  sort  the  possible  destination  states  q  into  increasing  (decreas¬ 
ing)  order  according  to  their  y_{V)  value,  and  then  choose  the  transition  prob¬ 
abilities  within  the  intervals  specified  by  F  so  as  to  send  as  much  probability 
mass  to  the  states  early  in  the  ordering.  Let  such  an  ordering 

of  Q — so  that,  in  the  minimizing  case,  for  all  i  and  j  if  1  <  i  <  j  <  ^  then 
V^{qi)  <  YXqj)  (increasing  order). 

Let  r  be  the  index  I  <  r  <  k  which  maximizes  the  following  expression 
without  letting  it  exceed  1: 


r-l 


k 


r  is  the  index  into  the  sequence  g,  such  that  below  index  r  we  can  assign  the 
upper  bound,  and  above  index  r  we  can  assign  the  lower  bound,  with  the  rest  of 
the  probability  mass  from  p  under  a  being  assigned  to  qr^  Formally,  we  choose 
Fpq{a)  for  all  g  €  Q  as  follows: 


i  =  fc 


Fig.  2.  An  illustration  of  the  basic  dynamic  programming  step  in  computing  an  ap¬ 
proximate  value  function  for  a  fixed  policy  and  boimded  parameter  MDP.  The  lighter 
shaded  portions  of  each  arc  represent  the  required  lower  bound  transition  probabil¬ 
ity  and  the  darker  shaded  portions  represent  the  fraction  of  the  remaining  transition 
probability  to  the  upper  bound  assigned  to  the  zirc  by  F. 


Figure  2  illustrates  the  basic  iterative  step  in  the  above  algorithm,  for  the 
maximizing  case.  The  states  qi  are  ordered  according  to  the  value  estimates  in 
V.  The  transitions  from  a  state  p  to  states  qi  are  defined  by  the  function  F  such 
that  each  transition  is  equal  to  its  lower  bound  plus  some  fraction  of  the  leftover 
probability  mass. 

Techniques  similar  to  those  in  Section  4  can  be  used  to  prove  that  iterating 
IV I ^  {IV I converges  to  V,^  (Vtt)-  The  key  theorems,  stated  below,  assert 
first  that  IV I ^  is  a  contraction  mapping,  and  second  that  V^^  is  a  fixed-point  of 
IV I^,  and  are  easily  proven®. 

Theorem  11.  For  any  policy  tt,  IV I ^  and  TVl^,  are  contraction  mappings. 

Theorem  12.  For  any  policy  tt,  V__^  is  a  fixed-point  of  IV I ond  Vtt  of  IVI^x- 

These  theorems,  together  with  Theorem  6  (the  Banach  fixed-point  theorem)  im¬ 
ply  that  iterating  IV I on  any  initial  interval  value  function  converges  to  Kr, 
regardless  of  the  starting  point. 

Theorem  13.  For  fixed  7,  interval  policy  evaluation  converges  to  the  desired  in¬ 
terval  value  function  in  a  number  of  steps  polynomial  in  the  number  of  states,  the 
number  of  actions,  and  the  number  of  bits  used  to  represent  the  MDP  parameters. 


®  The  min  over  members  of  M  is  dealt  with  using  a  technique  similatr  to  that  used  to 
handle  the  max  over  actions  in  the  same  proof  for  V* 


5,2  Interval  Value  Iteration 

As  in  the  case  of  Vlyr  and  VI ^  it  is  straightforward  to  modify  IV It:  so  that  it 
computes  optimal  policy  value  intervals  by  adding  a  maximization  step  over  the 
different  action  choices  in  each  state.  However,  unlike  standard  value  iteration, 
the  quantities  being  compared  in  the  maximization  step  are  closed  real  intervals, 
so  the  resulting  algorithm  varies  according  to  how  we  choose  to  compare  real 
intervals.  We  define  two  variations  of  interval  value  iteration — other  variations 
are  possible. 

IVIcpt{V)(p)=  max 

^opt 

IVIpe,(V)iP)=  max 

<pes 

The  added  maximization  step  introduces  no  new  difficulties  in  implementing 
the  algorithm.  We  discuss  convergence  for  IV I  opt — the  convergence  results  for 
IVIpes  are  similar.  We  write  IV I  opt  for  the  upper  bound  returned  by  IV I  opt, 
and  we  consider  IV I  opt  a  function  from  V  to  V  because  IVIopt{V)  depends 
only  on  V.  IV I  opt  can  be  easily  shown  to  be  a  contraction  mapping,  and  it 
can  be  shown  that  I4pt  is  a  fixed  point  of  IV I  opt  •  It  then  follows  that  IV I  opt 
converges  to  l^opt  in  polynomially  many  steps.  The  analogous  results  for  IV I 
are  somewhat  more  problematic.  Because  the  action  selection  is  done  according 
to  <opt5  which  focuses  primarily  on  the  interval  upper  bounds,  IV I is  not 
properly  a  mapping  from  V  to  V,  as  IV I depends  on  both  V  and  V. 
However,  for  any  particular  value  function  V  and  interval  value  function  V  such 
that  F  =  V,  we  can  write  IVI^^^y  for  the  mapping  from  V  to  V  which  carries  V 
to  We  can  then  show  that  for  each  V,  IV I y  converges  as  desired. 

The  algorithm  must  then  iterate  IV I  opt  convergence  to  some  upper  bound  V, 
and  then  iterate  IV I y  to  converge  to  the  lower  bounds  V_ — each  convergence 
within  polynomial  time. 

Theorem  14.  A.  IV I  opt  and  IVI^^,  are  contraction  mappings. 

B.  For  any  value  functions  V,  IVI.y  y  and  IVIpes,v  are  contraction  mappings. 

Theorem  15.  t^pt  *5  a  fixed-point  of  IV I  opt  ,  and  Vpes  of  IVIpes  • 

Theorem  16.  For  fixed  7,  iteration  of  IV I  opt  converges  to  Vopt,  and  iteration 
of  IVIpes  converges  to  Vpesj  *n  polynomially  many  iterations  in  the  problem  size 
(including  the  number  of  bits  used  in  specifying  the  parameters). 

6  Policy  Selection,  Sensitivity  Analysis,  and  Aggregation 

In  this  section,  we  consider  some  basic  issues  concerning  the  use  and  interpre¬ 
tation  of  bounded  parameter  MDPs.  We  begin  by  reemphasizing  some  ideas 
introduced  earlier  regarding  the  selection  of  policies. 


,91*2.  max  VImA'^){p) 


min  F/m,c(V)(p),  m^xVlMAV){p) 

AfCAl 


To  begin  with,  it  is  important  that  we  are  clear  on  the  status  of  the  bounds 
in  a  bounded  parameter  MDP.  A  bounded  parameter  MDP  specifies  upper  and 
lower  bounds  on  individual  parameters;  the  assumption  is  that  we  have  no  addi¬ 
tional  information  regarding  individual  exact  MDPs  whose  parameters  fall  with 
those  bounds.  In  particular,  we  have  no  prior  over  the  exact  MDPs  in  the  family 
of  MDPs  defined  by  a  bounded  parameter  MDP. 

Policy  selection  Despite  the  lack  of  information  regarding  any  particular  MDP, 
we  may  have  to  choose  a  policy.  In  such  a  situation,  it  is  natural  to  consider 
that  the  actual  MDP,  i.e.,  the  one  in  which  we  will  ultimately  have  to  carry  out 
some  policy,  is  decided  by  some  outside  process.  That  process  might  choose  so 
as  to  help  or  hinder  us,  or  it  might  be  entirely  indifferent.  To  minimize  the  risk 
of  performing  poorly,  it  is  reasonable  to  think  in  adversarial  terms;  we  select 
the  policy  which  will  perform  as  well  as  possible  assuming  that  the  adversary 
chooses  so  that  we  perform  as  poorly  as  possible. 

These  choices  correspond  to  optimistic  and  pessimistic  optimal  policies.  We 
have  discussed  in  the  last  section  how  to  compute  interval  value  functions  for 
such  policies — such  value  functions  can  then  be  used  in  a  straightforward  manner 
to  extract  policies  which  achieve  those  values. 

There  are  other  possible  choices,  corresponding  in  general  to  other  means  of 
totally  ordering  real  closed  intervals.  We  might  for  instance  consider  a  policy 
whose  average  performance  over  all  MDPs  in  the  family  is  as  good  as  or  better 
than  the  average  performance  of  any  other  policy.  This  notion  of  average  is 
potentially  problematic,  however,  as  it  essentially  assumes  a  uniform  prior  over 
exact  MDPs  and,  as  stated  earlier,  the  bounds  do  not  imply  any  particular  prior. 

Sensitivity  analysis  There  are  other  ways  in  which  bounded  parameter  MDPs 
might  be  useful  in  planning  under  uncertainty.  For  example,  we  might  assume 
that  we  begin  with  a  particular  exact  MDP,  say,  the  MDP  with  parameters  whose 
values  reflect  the  best  guess  according  to  a  given  domain  expert.  If  we  were  to 
compute  the  optimal  policy  for  this  exact  MDP,  we  might  wonder  about  the 
degree  to  which  this  policy  is  sensitive  to  the  numbers  supplied  by  the  expert. 

To  explore  this  possible  sensitivity  to  the  parameters,  we  might  assess  the 
policy  by  perturbing  the  parameters  and  evaluating  the  policy  with  respect  to 
the  perturbed  MDP.  Alternatively,  we  could  use  BMDPs  to  perform  this  sort  of 
sensitivity  analysis  on  a  whole  family  of  MDPs  by  converting  the  point  estimates 
for  the  parameters  to  confidence  intervals  and  then  computing  bounds  on  the 
value  function  for  the  fixed  policy  via  interval  policy  evaluation. 

Aggregation  Another  use  of  BMDPs  involves  a  different  interpretation  altogether. 
Instead  of  viewing  the  states  of  the  bounded  parameter  MDP  as  individual  prim¬ 
itive  states,  we  view  each  state  of  the  BMDP  as  representing  a  set  or  aggregate 
of  states  of  some  other,  larger  MDP. 

In  this  interpretation,  states  are  aggregated  together  because  they  behave 
approximately  the  same  with  respect  to  possible  state  transitions.  A  little  more 
precisely,  suppose  that  the  set  of  states  of  the  BMDP  jVf  corresponds  to  the  set 


of  blocks  such  that  the  {Bi}  constitutes  the  partition  of  another 

MDP  with  a  much  larger  state  space. 

Now  we  interpret  the  bounds  as  follows;  for  any  two  blocks  Bi  and  Bjy  let 


PSiBj  (<^)  represent  the  interval  value  for  the  transition  from  Bi  to  Bj  on  action  a 

defined  as  follows:  Fb^Bj  (<>)  =  .  EgcB,  maxp€B.  E,eB,  Fpg{a)^ 

Intuitively,  this  means  that  all  states  in  a  block  behave  approximately  the  same 
(assuming  the  lower  and  upper  bounds  are  close  to  each  other)  in  terms  of 
transitions  to  other  blocks  even  though  they  may  differ  widely  with  regard  to 
transitions  to  individual  states. 

In  Dean  et  ai  [l997]  we  discuss  methods  for  using  an  implicit  representation 
of  a  exact  MDP  with  a  large  number  of  states  to  construct  an  explicit  BMDP 
with  a  possibly  much  smaller  number  of  states  based  on  an  aggregation  method. 
We  then  show  that  policies  computed  for  this  BMDP  can  be  extended  to  the 
original  large  implicitly  described  MDP.  Note  that  the  original  implicit  MDP 
is  not  even  a  member  of  the  family  of  MDPs  for  the  reduced  BMDP  (it  has  a 
different  state  space,  for  instance).  Nevertheless,  it  is  a  theorem  that  the  policies 
and  value  bounds  of  the  BMDP  can  be  soundly  applied  in  the  original  MDP 
(using  the  aggregation  mapping  to  connect  the  state  spaces). 


7  Related  Work  and  Conclusions 

Our  definition  for  bounded  parameter  MDPs  is  related  to  a  number  of  other 
ideas  appearing  in  the  literature  on  Markov  decision  processes;  in  the  follow¬ 
ing,  we  mention  just  a  few  such  ideas.  First,  BMDPs  specialize  the  MDPs  with 
imprecisely  known  parameters  (MDPIPs)  described  and  analyzed  in  the  op¬ 
erations  research  literature  [White  and  Eldeib,  1994,  White  and  Eldeib,  1986, 
Satia  and  Lave,  1973].  The  more  general  MDPIPs  described  in  these  papers  re¬ 
quire  more  general  and  expensive  algorithms  for  solution.  For  example,  [White 
and  Eldeib,  1994]  allows  an  arbitrary  linear  program  to  define  the  bounds  on  the 
transition  probabilities  (and  allows  no  imprecision  in  the  reward  parameters) — 
as  a  result,  the  solution  technique  presented  appeals  to  linear  programming  at 
each  iteration  of  the  solution  algorithm  rather  than  exploit  the  specific  structure 
available  in  a  BMDP.  [Satia  and  Lave,  1973]  mention  the  restriction  to  BMDPs 
but  give  no  special  algorithms  to  exploit  this  restriction.  Their  general  MDPIP 
algorithm  is  very  different  from  our  algorithm  and  involves  two  nested  phases 
of  policy  iteration — the  outer  phase  selecting  a  traditional  policy  and  the  inner 
phase  selecting  a  “policy”  for  “nature”,  i.e.,  a  choice  of  the  transition  parameters 
to  minimize  or  maximize  value  (depending  on  whether  optimistic  or  pessimistic 
assumptions  prevail).  Our  work,  while  originally  developed  independently  of  the 
MDPIP  literature,  follows  similar  lines  to  [Satia  and  Lave,  1973]  in  defining 
optimistic  and  pessimistic  optimal  policies. 

Bertsekas  and  Castanon  [1989]  use  the  notion  of  aggregated  Markov  chains 
and  consider  grouping  together  states  with  approximately  the  same  residuals. 
Methods  for  bounding  value  functions  are  frequently  used  in  approximate  algo¬ 
rithms  for  solving  MDPs;  Lovejoy  [l99l]  describes  their  use  in  solving  partially 


observable  MDPs.  Puterman  [1994]  provides  an  excellent  introduction  to  Markov 
decision  processes  and  techniques  involving  bounding  value  functions. 

Boutilier  and  Dearden  [1994]  and  Boutilier  et  aL  [l995b]  describe  methods  for 
solving  implicitly  described  MDPs  and  Dean  and  Givan  [1997]  reinterpret  this 
work  in  terms  of  computing  explicitly  described  MDPs  with  aggregate  states. 

Bounded  parameter  MDPs  allow  us  to  represent  uncertainty  about  or  vari¬ 
ation  in  the  parameters  of  a  Markov  decision  process.  Interval  value  functions 
capture  the  resulting  variation  in  policy  values.  In  this  paper,  we  have  defined 
both  bounded  parameter  MDP  and  interval  value  function,  and  given  algorithms 
for  computing  interval  value  functions,  and  selecting  and  evaluating  policies. 
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Abstract 

Sampling  is  an  important  tool  for  estimating 
large,  complex  sums  and  integrals  over  high¬ 
dimensional  spaces.  For  instance,  importance 
sampling  has  been  used  as  an  alternative  to  exact 
methods  for  inference  in  belief  networks.  Ideally, 
we  want  to  have  a  sampling  distribution  that  pro¬ 
vides  optimal-variance  estimators.  In  this  paper, 
we  present  methods  that  improve  the  sampling 
distribution  by  systematically  adapting  it  as  we 
obtain  information  from  the  samples.  We  present 
a  stochastic-gradient-descent  method  for  sequen¬ 
tially  updating  the  sampling  distribution  based  on 
the  direct  minimization  of  the  variance.  We  also 
present  other  stochastic-gradient-descent  meth¬ 
ods  based  on  the  minimization  of  typical  notions 
of  distance  between  the  current  sampling  distri¬ 
bution  and  approximations  of  the  target,  optimal 
distribution.  We  finally  validate  and  compare  the 
different  methods  empirically  by  applying  them 
to  the  problem  of  action  evaluation  in  influence 
diagrams. 


1  INTRODUCTION 

Often,  we  are  interested  in  computing  quantities  involving 
large  sums,  such  as  expectations  in  uncertain,  structured 
domains.  For  instance,  belief  inference  in  Bayesian  net¬ 
works  (BNs)  requires  that  we  sum  or  marginalize  over  the 
remaining  variables  that  are  not  of  interest.  Similarly,  in 
order  to  solve  the  problem  of  action  selection  in  influence 
diagrams,  we  sum  over  the  variables  that  are  not  observed 
at  the  time  of  the  decision  in  order  to  compute  the  value  of 
different  action  choices. 

We  can  represent  the  uncertainty  in  structured  environ¬ 
ments  using  a  BN.  A  BN  allows  us  to  compactly  define 
a  joint  probability  distribution  over  the  relevant  variables 
in  a  domain.  It  provides  a  graphical  representation  of  the 
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distribution  by  means  of  a  directed  acyclic  graph  (DAG). 
It  defines  locally  a  conditional  probability  distribution  for 
each  relevant  variable,  represented  as  a  node  in  the  graph, 
given  the  state  of  its  parents  in  the  graph.  This  decomposi¬ 
tion  can  help  in  the  evaluation  of  the  sums.  However,  due 
to  factors  regarding  the  connectivity  of  the  graph,  in  gen¬ 
eral  this  is  not  sufficient  to  allow  an  efficient  computation 
of  the  exact  value  of  the  sums  of  interest. 

Sampling  provides  an  alternative  tool  for  approximately 
computing  these  sums.  Sampling  methods  have  been  pro¬ 
posed  as  an  alternative  to  exact  methods  for  such  problems. 
In  particular,  importance  sampling  (see  Geweke  [1989], 
and  the  references  therein)  has  been  applied  to  the  prob¬ 
lem  of  belief  inference  in  BNs  [Fung  and  Chang,  1989, 
Shachter  and  Peot,  1989]  and  action  selection  in  IDs  (see 
Chames  and  Shenoy  [1999]  and  the  references  therein, 
and  Ortiz  and  Kaelbling  [2000]).  In  its  simpler  form,  the 
importance-sampling  distribution  used  is  the  “prior”  dis¬ 
tribution  of  the  BN  resulting  from  setting  the  value  of  the 
evidence.  It  has  been  noted  early  on  that  this  sampling  dis¬ 
tribution  is  far  from  optimal  in  the  sense  that  it  provides  es¬ 
timates  with  larger  variance  than  necessary  [Shachter  and 
Peot,  1989].  For  instance,  the  optimal  sampling  distribu¬ 
tion  in  the  case  of  belief  inference  is  to  sample  the  unob¬ 
served  variables  from  the  posterior  distribution  over  them 
given  the  observed  evidence.  If  we  knew  this  distribution 
we  would  know  the  answer  to  the  belief  inference  problem. 

Several  modifications  have  been  proposed  to  improve  the 
estimation  of  the  simple  importance  sampling  distribu¬ 
tion  discussed  above,  based  on  information  obtained  from 
the  samples  [Fung  and  Chang,  1989,  Shachter  and  Peot, 
1989,  Shwe  and  Cooper,  1991].  In  this  paper,  we  pro¬ 
pose  methods  to  systematically  and  sequentially  update  the 
importance-sampling  distribution.  We  view  the  updating 
process  as  one  of  learning  a  separate  BN  just  for  sampling. 
The  learning  objective  is  to  minimize  some  error  criterion. 
A  stochastic-gradient  method  results  from  the  direct  min¬ 
imization  of  the  variance  of  the  estimator  with  respect  to 
the  importance  sampling  distribution  as  an  error  function. 
Other  stochastic-gradient  methods  result  from  minimizing 


error  functions  based  on  typical  measures  of  the  notion  of 
distance  between  the  current  sampling  distribution  and  ap¬ 
proximations  of  the  optimal  sampling  distribution. 

2  DEFINITIONS 

We  begin  by  introducing  some  notation  used  throughout 
the  paper.  We  denote  one-dimensional  random  variables 
by  capital  letters  and  denote  multi-dimensional  random 
variables  by  bold  capital  letters.  For  instance,  we  de¬ 
note  a  multi-dimensional  random  variable  by  X  and  de¬ 
note  all  its  components  by  (A'l, . . .  , X„)  where  Xi  is  the 
one-dimensional  random  variable.  We  use  small  let¬ 
ters  to  denote  assignments  to  random  variables.  For  in¬ 
stance,  X  =  X  means  that  for  each  component  Xi  of  X, 
Xi  =  Xi,  We  denote  the  set  of  possible  values  that  Xi  can 
take  by  Qxi  and  the  set  of  possible  values  that  X  can  take 
by  fix  =  X  •  We  also  denote  by  capital  letters  the 

nodes  in  a  graph.  We  denote  by  Pa(y )  the  parents  of  node 
y  in  a  directed  graph. 

We  now  introduce  notation  that  will  become  useful  dur¬ 
ing  the  description  of  the  methods  presented  in  this  pa¬ 
per.  We  denote  by  the  operator  J2z 
the  possible  values  of  the  individual  variables  forming 
^2i  ’ '  *  ^Zn  •  function  h  with  vari¬ 

ables  Z  and  O,  the  expression  h{Z,0)\Q^^  stands  for 
a  function  /'  over  variables  Z  that  results  from  setting 
the  values  of  O  in  h  with  assignment  o  while  letting 
the  values  for  Z  remain  unassigned.  In  other  words, 
f{Z)-h{Z,0)\Q^^  =  h{Z,0  =  o).  The  notation 
X  =  (Z,  O)  means  that  the  variable  X  is  formed  by 
all  the  variables  that  form  Z  and  O.  That  is,  X  — 

(Xi,...  ,Xn)  =  (^1,...  O), 

where  n  =  ni  -f  n2.  Note  that  we  are  assuming  that  the  set 
of  variables  forming  Z  and  those  forming  O  are  disjoint. 
The  notation  Z  ^  f  means  that  the  random  variable  Z  is 
distributed  according  to  probability  distribution  /. 

A  Bayesian  network  (BN)  is  a  graphical  probabilistic  model 
used  to  represent  uncertainty  in  structured  domains.  It  com¬ 
pactly  represents  the  joint  probability  distribution  over  the 
relevant  variables  of  the  system  of  interest.  It  uses  a  di¬ 
rected  acyclic  graph  (DAG)  to  represent  the  relationship 
between  the  relevant  variables.  A  node  in  the  graph  rep¬ 
resents  a  variable.  The  model  defines  a  local  conditional 
distribution  P{Xi  |  Pa(JYi))  for  each  node  or  variable  Xi 
given  its  parents  Pa(Xi)  in  the  graph.  The  joint  distribution 
is  then 

w=nr=iWiPa(xo). 

For  instance,  we  can  define  a  BN  on  the  graph  given  in 
Figure  1(a). 

The  inference  problem  in  BNs  is  that  of  computing  the  pos¬ 
terior  probability  of  an  assignment  to  a  subset  of  variables 


Figure  1:  Example  of  (a)  Bayesian  network  and  (b)  influ¬ 
ence  diagram. 

given  evidence  about  another  subset  of  variables  in  the  sys¬ 
tem.  Assume  that  the  variables  are  discrete  and  their  sam¬ 
ple  spaces  or  the  possible  values  each  variable  can  take  are 
finite.  In  general,  let  X  =  (Z,  O)  where  O  is  the  set  of 
variables  of  interest,  o  is  an  assignment  to  it  and  Z  are  the 
remaining  variables.  For  this  problem  we  want  to  compute 
probabilities  of  the  kind 

P(0^o)  =  ZzP(^.O^o). 

Often,  the  local  decomposition  of  the  joint  distribution  still 
leads  to  the  evaluation  of  sums  over  a  large  number  of 
variables.  In  general,  this  problem  is  intractable  [Cooper, 
1990]. 

An  influence  diagram  (ID)  is  a  probabilistic  model  for 
decision-making  under  uncertainty.  We  can  think  of  an  ID 
as  a  BN  with  decision  and  utility  nodes  added.  For  instance, 
we  can  use  our  example  BN  to  build  an  ED  as  shown  in  Fig¬ 
ure  1(b).  The  square  is  a  decision  node.  The  diamond  is  a 
utility  node.  We  now  have  potentially  different  joint  distri¬ 
butions  over  the  variables,  for  each  action  choice  available. 
Assume  for  simplicity  that  there  is  a  single  decision  node 
in  the  graph.  The  joint  distribution  over  the  variables,  given 
the  action  choice  a  assigned  to  the  decision  variable,  is 

P(X  I  ^  =  a)  =  nr=i  TO  I  . 

The  decision  associated  with  a  decision  node  is  a  function 
of  its  parent  nodes  in  the  graph.  We  will  have  access  to 


the  value  of  these  variables  at  the  time  of  making  the  deci¬ 
sion.  Similarly,  the  utility  associated  with  a  utility  node  is 
a  function  of  its  parent  nodes  in  the  graph. 

Assume  that  we  have  a  finite  number  of  discrete  action 
choices.  Then,  one  problem  is  to  select  the  best  strategy  or 
function  tt*  mapping  each  possible  value  of  the  parents  of 
the  decision  node  to  an  action  choice.  The  best  strategy  is 
the  strategy  with  highest  expected  utility.  Let  X  =  (Z,  O) 
where  the  variables  in  O  are  parents  of  the  decision  node 
and  Z  are  the  remaining  variables.  The  problem  of  ob¬ 
taining  an  optimal  strategy  reduces  to  obtaining,  for  each 
assignment  O  =  o,  the  action  that  maximizes  the  value 
associated  with  the  action  and  the  assignment: 

Voia)  -  Ez  O  =  o  1  A  =  a)U{Z,  O  =  o,  A  =  a). 

Note  once  again  that  computing  this  value  requires  the  eval¬ 
uation  of  a  sum.  For  the  same  reasons  as  in  the  previous 
problem  of  belief  inference  in  BNs,  the  exact  computation 
of  this  value  is  intractable  in  general. 

3  IMPORTANCE  SAMPLING 

Importance  sampling  provides  an  alternative  to  the  exact 
methods  for  evaluating  sums.  Let  the  quantity  of  inter¬ 
est  be  G  =  Ez^(-^)  some  real  function  g.  We 
can  turn  the  sum  into  an  expectation  by  expressing  G  — 
£z  /(^)  {di^) //('^))»  where  /  is  a  probability  distribu¬ 
tion  over  Z  satisfying,  for  all  Z,  g{Z)  0  =>  f{Z)  ^  0. 
We  call  /  the  importance-sampling  distribution.  We  de¬ 
fine  the  weight  function  u;{Z)  =  g{Z)ff{Z)  which  al¬ 
lows  us  to  express  G  =  Ez  Hence,  we  can 

obtain  an  unbiased  estimate  of  G  by  obtaining  N  samples 
. . .  ,  from  Z  f  and  computing  the  estimate 

G  =  (1) 

We  can  apply  this  technique  to  the  problem  of  belief  infer¬ 
ence  in  BNs.  Typically,  we  let 

g{Z)  ^P{Z,0^o) 

=nr=i  p{Zi  I  Pa(zo)  n -ii  1  , 

f{Z)  =  n"=iP(-2^i  I  Pa(-^i))lo=o> 

<^(z)  = 

Note  that  we  are  defining  the  importance  sampling  distri¬ 
bution  to  be  the  “prior”  distribution  of  the  BN.  We  obtain 
samples  from  this  distribution  by  sampling  the  variables 
in  the  (partial)  order  defined  by  the  DAG  and  according 
to  the  local  conditional  distribution  of  the  original  BN  for 
each  variable.  As  we  obtain  samples  from  each  variable  by 
traversing  the  nodes  in  the  graph  and  sampling  the  variable 
corresponding  to  it,  if  we  get  to  a  node  or  variable  that  is  in 
the  evidence  set  O,  we  do  not  sample  it.  Instead,  we  assign 


to  it  the  value  given  by  the  evidence  assignment  o.  There¬ 
fore,  the  resulting  samples  will  be  assignments  to  those 
variables  that  are  not  in  the  evidence  set  according  to  the 
“prior”  distribution  of  the  BN.  We  call  the  method  resulting 
from  this  importance-sampling  distribution  the  traditional 
method.  In  Ae  context  of  belief  inference,  this  method  is 
called  likelihood-weighting  (LW)  since  the  weight  function 
is  a  “likelihood”  and  thus  each  sample  is  weighted  by  its 
“likelihood.” 

We  can  similarly  apply  this  technique  in  the  context  of  ac¬ 
tion  selection  in  IDs  to  evaluate  Vo{a).  In  general,  we  let 

g{Z)  =  P(Z,0  =  ol  A  =  a)C/(Z,0-o,A  =  a), 

f{z)  =  nrii^(^iiPa(^o)io=o,^=<.. 

AZ)  =  n%,P{Oi\P^iOi))U{Z,0,A)  . 

0=0, >1=0 

In  particular,  for  our  example, 

g{Z)  =  P{Xi)P{X2  I  Xi)P{X3  I  Xi)  X 

P{Xe\X2,A  =  a)P{X7\X3,Xe)  x 

P{X4  =  X4  I  X2)PiX5  =  X5  I  X2,X3)  X 

C^(X7,^  =  a), 

f{Z)  =  P{Xi)P{X2  I  Xi)P(X3  I  Xi)  X 
P{Xe\X2,A  =  a)P{Xr\X3,Xe), 
u{Z)  =  P{X4  =  X4  I  X2)P{X5  =  X5  I  X2,X3)  X 
U{X7,A  =  a). 

An  important  property  of  the  estimator  G  is  the  variance  of 
the  weights  associated  with  the  importance-sampling  dis¬ 
tribution.  This  is 

Var[u;(Z)]  =  Ez/(^M^)"-G^'* 

Recall  that  G  =  Ez  definition  and  assume  that 

p  is  a  positive  function.  From  this  we  can  derive  that  the 
optimal  or  minimum- variance  importance-sampling  distri¬ 
bution  is  proportional  to  g{Z): 

nz)^g{Z)IY.z9{Z).  (2) 

The  weights  will  have  zero  variance  in  that  case,  since  the 
weight  function  will  always  output  our  value  of  interest 
G.  We  also  note  that  we  need  to  avoid  letting  /(Z)  be 
too  small  with  respect  to  g{Z),  since  this  will  increase 
the  variance.  As  a  matter  of  fact,  Var[a;(Z)]  00  as 
/(Z)  0  for  at  least  one  value  of  Z.  This  implies  that 

we  should  use  importance-sampling  distributions  with  suf¬ 
ficiently  “fat  tails.” 

4  ADAPTIVE  IMPORTANCE  SAMPLING 

The  traditional  method  presented  above  uses  as  the 
importance-sampling  distribution  the  “prior”  distribution 


of  the  BN  which  can  be  far  from  optimal  in  the  sense  that 
it  can  have  higher  variance  than  necessary.  In  the  case  of 
evaluating  actions  in  IDs,  it  also  completely  ignores  poten¬ 
tially  useful  information  about  the  utility  values.  Therefore, 
we  try  to  learn  the  optimal  importance-sampling  distribu¬ 
tion  by  adapting  the  current  sampling  distribution  as  we 
obtain  samples  from  it. 

We  view  the  adaptive  process  as  one  of  learning  a  distribu¬ 
tion  over  the  variables  the  sum  is  over  to  use  specifically  as 
an  importance-sampling  distribution.  In  particular,  we  can 
view  Ais  process  as  learning  BNs  from  the  samples  just  for 
sampling.  From  the  expression  of  the  optimal  importance¬ 
sampling  distribution  given  in  equation  2  (and,  in  particu¬ 
lar,  from  the  factorization  of  the  ftmction  g  for  the  different 
estimation  problems),  we  can  deduce  that  in  order  to  be 
able  to  represent  this  distribution  graphically  using  a  BN 
we  need  to  add  arcs  that  connect  every  pair  of  nodes  that 
are  parents  of  observations  and/or  utility  nodes,  if  they  are 
not  already  connected.  However,  doing  so  can  increase  the 
size  of  the  model,  particularly  in  cases  where  the  local  con¬ 
ditional  probabilities  and  the  utilities  have  a  smaller,  more 
compact  parametric  representation  (i.e.,  noise-or’s).  In  this 
paper,  we  do  not  deal  with  this  issue  and  instead  concen¬ 
trate  on  the  problem  of  learning  a  BN  with  the  same  struc¬ 
ture  as  the  original  BN  (or  ID).  Hence,  we  only  need  to 
update  the  local  conditional  probability  distributions  as  we 
obtain  samples. 

We  can  parameterize  the  importance-sampling  distribution 
using  a  set  of  parameters  0.  Let  the  indicator  function 
I{Zi  =  fc,Pa(Zi)  =  j  1  Z)  =  1  if  the  condition  =  fc 
and  Pa(Zi)  =  j  agrees  with  the  value  assigned  to  Z;  0 
otherwise.  Then,  we  can  express  the  importance-sampling 
distribution  as 


/(zie)=n  n 


n^I{Zi=k,P^{Zi)=j\Z) 


t=l  j€f)pa(Zi)  ke^Zi 


(3) 


where  for  each  /c,  Oijk  =  P{Zi  =  k  \  Pa(Zi)  =  j,  ©). 
Hence,  for  all  ~  ^ 

Note  that  this  representation  uses  the  assumptions  of  global 
and  local  parameter  independence  typically  used  in  BNs. 
The  weight  function  is  also  parameterized  and  defined  as 
a;(Z|©)=^(Z)//(Zl©). 


4.1  LEARNING  CRITERIA  AND  UPDATE  RULES 

In  the  following  subsections  we  present  different  methods 
for  updating  the  sampling  distribution.  The  update  rules 
are  all  based  on  gradient-descent.  Hence,  at  each  time  t, 
we  update  the  parameters  as  follows: 

0(‘+i)  _  e(‘)  _  a(i)VPe(0W).  (4) 

In  the  update  rule  above,  a(t)  denotes  the  learning  rate  or 
the  step  size  rule  and  V^e(0)  denotes  the  gradient  of  error 


function  e,  appropriately  projected  to  satisfy  the  constraints 
on  0.  The  methods  differ  in  how  they  define  V^e(0^*^). 

In  the  discussion  below  we  denote  the  N{t)  i.i.d.  samples 
as  drawn  according  to  Z  /(Z  | 

0^*^).  If  we  gather  samples  to  estimate  G  using  many  dif¬ 
ferent  sampling  distributions,  how  can  we  combine  them 
to  get  an  unbiased  estimate?  It  is  sufficient  to  weight  them 
using  any  weighting  function  that  is  independent  of  the  sub¬ 
estimates  obtained  by  using  just  the  samples  for  one  sam¬ 
pling  distribution.  For  instance,  the  estimator 

=  ELi  W{t)G{e^%  (5) 

where  ^11=1  W{t)  =  1  and  W{t)  >  0,  for  all  t,  and 

=  (6) 

is  unbiased  as  long  as  W{t)  and  are  independent 

for  each  t.  Letting  W(t)  -  1/T  will  produce  an  unbi¬ 
ased  estimate.  This  is  the  weight  we  use  in  the  experi¬ 
ments.  In  general,  we  would  like  to  give  more  weight  to 
importance-sampling  distributions  with  smaller  variances. 
Assuming  that  the  variance  decreases  with  t,  we  would  like 
W{t)  to  be  an  increasing  sequence  of  t.  Note  that  using 
W{t)  (X  where  is  the  sample  variance  at  time  t, 
though  appealing,  does  not  necessarily  lead  to  an  unbiased 
estimator  since  kF(t)  and  G(0^^^)  are  not  independent. 

We  will  consider  three  general  strategies:  minimizing  vari¬ 
ance  directly,  minimizing  distance  to  global  approxima¬ 
tions  of  the  optimal  sampling  distribution,  and  minimizing 
distance  to  the  empirical  distribution  of  the  optimal  sam¬ 
pling  distribution  based  on  local  approximations.  For  the 
first  two  strategies,  we  will  find  that  we  can  express  the 
partial  derivatives  that  form  the  gradient  as,  for  all  fc, 


Oijk  ^ 

where  v?(Z,  0)  is  a  function  that  depends  on  the  error  func¬ 
tions.  Note  that  this  is  an  expectation.  Then,  the  methods 
update  the  parameters  by  estimating  the  value  of  the  partial 
derivatives  evaluated  at  the  current  setting  of  the  parame¬ 
ters  as 


^  =  Ez/(z|e) 


_  1  v^N(t) 

d0ijk  -  N{t)  l^l=l 


-;(^.=fc,Pa(Zi)=i|Z=z<*'>) 
(<) 


4.1.1  Minimizing  Variance  Directly 

As  we  noted  above,  the  optimal  importance-sampling  dis¬ 
tribution  for  estimating  G  is  that  which  minimizes  the 
variance  of  a;.  Using  that  as  our  objective,  we  derive  a 
stochastic-gradient  update  rule  for  the  parameters  of  the 


importance-sampling  distribution.  Let  the  error  function 
be 

evar(©)  =  Var(a;(Z  |  0)) 

=  Ezf(z\®Mz\®?-G^ 

The  corresponding  function  for  the  gradient  is 

^Var(Z,©)=a;(Z|©)2.  (7) 

Note  that  using  this  definition  of  (p  yields  an  unbiased  es¬ 
timate  of  the  gradient.  This  is  because  the  gradient  is  the 
expectation  of  a  particular  function  and,  in  this  case,  we  can 
always  evaluate  the  function  exactly.  Hence,  we  can  obtain 
an  unbiased  estimate  by  sampling  from  f{Z  \  ©). 

4.1.2  Minimizing  Variance  Indirectly  via 
Approximate  Global  Minimization 

Recall  the  optimal  importance-sampling  distribution  f*  for 
estimating  G  given  in  equation  2.  The  update  rules  of  the 
following  subsection  are  all  motivated  by  the  idea  of  reduc¬ 
ing  some  notion  of  distance  between  the  current  sampling 
distribution  and  this  optimal  sampling  distribution.  Note 
that  we  cannot  really  compute  the  values  of  the  optimal  dis¬ 
tribution  since  that  requires  knowing  the  normalizing  con¬ 
stant  'YjZ  ^  which  is  exactly  the  value  we  want 

to  estimate.  We  approximate  the  optimal  distribution  using 
the  current  estimate  of  G  as  follows 

f\Z)  =  9(Z)/G(^1  (8) 

In  the  following,  we  will  consider  four  error  functions,  one 
based  on  the  sum-squared-error  and  three  based  on  versions 
of  the  Kullback-Leibler  divergence. 

If  we  use  the  L2  norm  or  sum-squared-error  function  as  a 
notion  of  distance  between  the  distributions,  then  the  error 
function  is 

The  corresponding  function  for  the  gradient  is 

=  r(z)-/(zi©) 

1  -  1)  ,  (9) 

where  the  approximation  results  from  using  /*(Z)  as  de¬ 
fined  in  equation  8  as  an  approximation  to  /*  (Z). 

An  alternative,  commonly-used  notion  of  distance  between 
two  probability  distributions  is  given  by  the  Kullback- 
Leibler  (KL)  divergence.  This  measure  is  not  symmetric. 
One  version  of  the  KL  divergence  in  this  context  is  given 
by  the  error  function 

eKL.  (0)  =  Ez  nz)  log  {r{Z)/f{Z  1  0)) . 


The  corresponding  function  for  the  gradient  is 

<fiKuiz,@)  =  r(z)//(z|0) 

«  I  (10) 

Another  version  of  the  KL  divergence  is  given  by  the  error 
function 

eKui®)  =  Ez  fiZ  I  ©)log(/(Z  I  0)//*(Z)) . 
The  corresponding  function  for  the  gradient  is 

<PKL,iz,e)  =iog(r(z)//(z  1 0))  - 1 

« log  I  -  1.  (11) 

A  “symmetrized”  version  of  KL  sometimes  used  is  given 
by  the  error  function 

eKL.(©)  =  5eKLi(©)  +  ^SKui®)- 

We  can  obtain  the  partial  derivatives  for  this  error  function 
and  their  approximation  accordingly. 

4.1.3  Heuristic  Local  Minimization  Based  on 
Empirical  Distribution 

The  update  methods  in  this  subsection  are  motivated  by  the 
idea  of  minimizing  different  notions  of  distance  between 
the  current  sampling  distribution  and  an  empirical  distribu¬ 
tion  of  the  optimal  importance-sampling  distribution  that 
we  build  from  the  samples.  The  hope  is  that  the  empirical 
distribution  is  a  good  approximation  of  the  optimal  sam¬ 
pling  distribution.  We  define  the  empirical  distribution,  pa¬ 
rameterized  by  ©  locally  as  follows:  for  all  i,  7,  fc, 

m  ^  /(Z<=fc,Pa(Zi)=j|Z=«(‘’'))a>(»<«-‘)  |et«)) 

ifEz=^i‘^-f(Pa(Zi)  =j\Z  =  2W))a;(z(*-')  |  ^W)  ^  0; 
^i%  —  ^ifk  Otherwise.  We  are  essentially  defining  the  em- 
pirical  distribution  using  the  samples  if  there  are  samples 
that  can  be  used  to  define  it;  otherwise,  we  revert  to  the 
current  distribution.  We  try  to  minimize  the  distance  be¬ 
tween  the  current  sampling  distribution  and  the  empirical 
distribution  locally. 

Similar  to  the  case  of  the  previous  strategies,  we  will  find 
that  we  can  express  the  partial  derivatives  that  form  the  gra¬ 
dient  of  the  error  functions  discussed  in  this  subsection  as, 
foralHjj,  fc, 

^  =  -v>'0iikAjk), 

where  (p^{0ijk ,  Oijk)  is  a  function  that  depends  on  the  error 
functions.  Then,  the  methods  update  the  parameters  by  es¬ 
timating  the  value  of  the  partial  derivatives  evaluated  at  the 
current  setting  of  the  parameters  as 


We  define  the  local  L2-nonn  error  function  as 

^L2(®)  “  \  ■”  ’ 

the  error  function  for  one  version  of  KL  as 

^kLi(®)  “  log  (^Oijk/^tjk^  ? 

and  the  other  as 

^kL2(®)  ~  l^S  {^ijk/^ijk^  • 

From  this  we  obtain  the  corresponding  functions  for  the 
gradient: 

‘i^L2  ( ^ ~  ^  ^  ’ 

^KLii^ijk^^ijk)  =  ^ijkl^ijk^ 

^kL2(^iifc’ =  log  “  !• 

We  can  obtain  an  update  rule  based  on  the  “symmetrized” 
version  of  KL  accordingly. 

4.2  DISCUSSION  OF  UPDATE  RULES 

First,  note  that  of  all  the  update  rules,  only  the  one  derived 
for  evar  clearly  uses  an  unbiased  estimate  of  the  gradient.  It 
is  not  immediately  apparent  whether  the  update  rules  based 
on  eL2»  ckLi  and  eKL2  use  unbiased  estimates. 

Note  also  that  the  magnitude  of  the  components  of  the  re¬ 
sulting  gradients  are  different,  as  suggested  by  their  respec¬ 
tive  if  functions.  The  function  (^var  has  magnitude  propor¬ 
tional  to  the  squares  of  the  weights.  The  magnitudes  of 
and  pkLi  are  linear  in  the  weights.  However,  the  magni¬ 
tude  of  v?l2  Is  potentially  smaller  since  it  has  the  probabil¬ 
ity  of  the  sample  as  a  factor.  The  magnitude  of  <^kl2  is 
logarithmic  in  the  weights. 

Because  we  assume  that  g  is  positive,  the  weights  are  pos¬ 
itive.  Hence,  (^var  and  (pKU  are  always  positive.  The 
function  ipi,^  is  positive  if  u{Z  j  0)/G  >  1.  Similarly, 
the  function  v?kL2  is  positive  if  log(a;(Z  |  ©)/G)  >  1. 
If  a;(Z  I  0)  >  G  then  the  sampling  distribution  under¬ 
estimates  the  value  of  g  while  if  u;(Z  |  0)  <  G  then  it 
overestimates  the  value.  Therefore,  the  sign  of  (pi,^  and 
</?kL2  depends  on  whether  we  under-  or  over-estimated  the 
value  of  g.  Similarly,  the  magnitudes  of  (^var»  V^L2»  ¥^kLi» 
and  (pKh2  related  to  the  amount  of  under-  or  over¬ 
estimation.  For  <^var,  V^L2  and  (pKU  the  magnitude  is  larger 
when  the  sampling  distribution  underestimates  than  when  it 
overestimates.  For  <^kl2»  the  logarithm  brings  the  amount 
of  over-  and  underestimation  to  the  same  scale.  Note  that 
for  the  approximations  of  (^l2»  V^kLi»  and  <^kL2>  ^  can¬ 
not  be  zero,  and  in  addition  for  (^kl2»  I  cannot 
be  zero.  These  conditions  hold  from  the  assumption  that  g 
is  positive.  Note  that  unless  we  constrain  the  importance¬ 
sampling  distribution,  all  the  functions  {^var»  ^ku 
and  ipKL2  will  be  unbounded  even  if  g  is  bounded. 


The  local  L2  error  function,  ,  leads  to  an  update  rule 
for  which  the  step  size  has  a  very  intuitive  interpretation 
as  a  weighting  between  the  current  importance-sampling 
distribution  and  the  empirical  distribution.  In  the  case 
of  ,  the  update  direction  is  proportional  to  the  ratio 
of  the  empirical  distribution  with  respect  to  the  current 
importance-sampling  distribution.  On  the  other  hand,  for 
ekLa’  update  direction  is  proportional  to  the  logarithm 
of  the  same  ratio.  Note  v?kl2  is  not  defined  if  at  least 
one  =  0.  We  can  fix  this  by  letting,  for  each  k. 

This  is  essentially  imposing  a  Dirichlet  prior  with  parame¬ 
ters  equal  to  the  current  probability  values  on  the  empirical 
distribution  parameters. 

We  can  interpret  the  update  rules  based  on  local  KL- 
divergence  as  adding  weights  to  the  elements  of  the  domain 
of  the  importance-sampling  distribution  and  renormalizing. 
For  the  version  of  KL-divergence  with  respect  to  the  em¬ 
pirical  distribution,  we  are  always  adding  weights.  We  add 
values  relative  to  the  amount  we  underestimated  or  over¬ 
estimated  the  magnitude  of  the  distribution  for  a  particu¬ 
lar  state.  If  we  underestimated,  we  add  weights  larger  than 
one.  If  we  overestimated,  we  add  weights  smaller  than  one. 
For  the  other  version  of  KL-divergence,  due  to  the  loga¬ 
rithm  function,  we  add  weight  if  we  underestimated  while 
we  subtract  weight  if  we  overestimated.  Therefore,  the  log¬ 
arithm  brings  the  amount  of  underestimation  and  overesti¬ 
mation  to  the  same  scale  and  adds  or  subtracts  weight  ac¬ 
cordingly. 

Note  that  when  approximating  the  gradients  for  evan  ^L2^ 
ckLi  and  CKLa » we  can  use  as  little  as  one  sample  to  obtain 
an  estimate  of  the  gradient  (i.e.,  N{t)  =  1).  This  is  not  ad¬ 
visable  for  the  method  based  on  the  local  heuristic  since  the 
empirical  distribution  of  the  optimal  sampling  distribution 
will  be  highly  inaccurate.  Hence,  the  update  rules  based  on 
the  empirical  distribution  will  work  better  when  we  take  a 
larger  number  of  samples  between  updates.  Finally,  note 
that  when  t  =  1  and  N{t)  =  1,  (ph2  =  0,  and  therefore,  the 
parameters  will  not  change  in  the  first  iteration. 

5  RELATED  WORK 

Different  variations  of  importance  sampling  have  been  used 
for  the  problems  discussed  in  this  paper  (See  Lin  and 
Druzdzel  [1999]  and  the  references  therein).  Our  methods 
belong  to  the  class  of  forward  samplers  since  they  sam¬ 
ple  from  a  distribution  based  on  the  original  structure  of 
the  BN.  Of  these,  self  importance  sampling  [Shachter  and 
Peot,  1989,  Shwe  and  Cooper,  1991]  is  the  method  closest 
to  the  methods  proposed  in  this  paper  since  it  also  updates 
the  sampling  distribution  as  it  obtains  information  from  the 
samples.  This  method  has  an  update  rule  that  is  very  sim¬ 
ilar  to  the  one  derived  for  It  updates  the  distribution 


after  obtaining  the  empirical  distribution,  but  the  update  is 
a  weighting  between  the  empirical  distribution  and  the  first 
sampling  distribution  used  [Shwe  and  Cooper,  1991].  The 
update  rule  is 

““  ^ijk  “ 

In  our  framework,  we  can  think  of  this  update  rule  as  re¬ 
sulting  from  the  error  function 


Figure  2:  Graphical  representation  of  the  ID  for  the  com¬ 
puter  mouse  problem. 


Annealed  importance  sampling  [Neal,  1998]  is  a  related 
technique  in  that  it  tries  to  obtain  samples  from  the  opti¬ 
mal  sampling  distribution.  As  we  understand  it,  the  user 
sets  up  a  sequence  of  distributions,  the  last  distribution  be¬ 
ing  the  optimal  distribution,  typically  defined  by  Markov 
chains.  We  move  from  one  distribution  to  another  as  we 
“anneal”  and  the  sequence  converges  to  the  optimal  sam¬ 
pling  distribution.  The  hope  is  that  we  can  get  an  inde¬ 
pendent  sample  from  that  distribution,  then  we  restart  the 
process  to  try  to  obtain  another  independent  sample,  and 
so  on.  Finally,  it  uses  those  independent  samples  to  obtain 
an  estimate.  Notice  that  each  “traversal”  of  the  sequence 
of  distributions  (or  Markov  chains)  produces  a  single  sam¬ 
ple.  The  technique  is  very  general  and  we  are  unaware  of 
whether  it  has  been  applied  to  the  problems  considered  in 
this  paper.  We  are  currently  investigating  possible  connec¬ 
tions  between  our  methods  and  this  technique. 

6  EMPIRICAL  RESULTS 

We  implemented  all  of  the  adaptive  importance-sampling 
methods  described  above.  We  let  the  learning  rate  a(^)  = 
13 /t,  where  /?  is  a  value  that  depends  on  the  updating 
method.  We  need  different  values  of  (3  for  the  different 
methods  because  of  the  differences  in  magnitude  of  their 
gradients.  We  impose  an  additional  constraint  on  the  pa¬ 
rameters  which  we  call  the  e-boundary.  We  require  that  for 
all  z,  j, fc,  Oijk  >  =  7/  where  7  is  a  con¬ 

stant  factor.  In  our  experiments,  we  let  7  =  0.1.  We  do 
this  so  that  our  sampling  distribution  has  “fat  tails”,  avoid¬ 
ing  extrema  in  probability  and  hence  the  possibility  of  in¬ 
finite  variance.  We  initialize  the  parameters  such  that 
the  starting  importance-sampling  distribution  is  the  “prior” 
probability  distribution  of  the  original  BN.  However,  if  one 
of  the  local  conditional  probability  values  does  not  satisfy 
the  e-boundary  constraint,  we  change  the  distribution  so 
that  it  does. 


In  order  to  satisfy  the  constraint  that  for  all  z,  j,  6ijk  = 

1,  we  project  the  approximation  of  the  gradients  onto  the 
simplex  of  the  local  conditional  probability  distribution. 
We  do  so  by  letting,  for  all  z,  j,  fc, 


dPe(9)  ae(e)  __  1  \  dejO) 

06ijk  dOijk  jQxi  j  dOijk 


(13) 


Note  that  this  is  not  enough  to  guarantee  that  after  taking  a 
step  in  the  projected  direction,  the  parameters  will  remain 
in  the  constraint  space.  If,  when  updating  a  local  condi¬ 
tional  probability  distribution,  its  respective  parameters  do 
not  satisfy  the  constraint,  we  find  the  minimum  step  a'  that 
will  allow  them  to  remain  inside  the  constraint  space  and 
take  a  step  of  size  a'/ 2  along  the  gradient  direction  (i.e., 
half  the  distance  between  the  current  position  of  the  param¬ 
eter  we  are  updating  in  the  simplex  and  the  closest  point  on 
the  €-boundary  along  the  gradient  direction). 

We  tested  the  methods  on  the  computer  mouse  prob¬ 
lem  [Ortiz  and  Kaelbling,  2000],  a  simple  made-up  ID 
shown  in  Figure  2.  We  added  one  to  all  the  utility  val¬ 
ues  presented  in  Ortiz  and  Kaelbling  [2000]  to  make  g 
positive.  We  will  consider  the  problem  of  obtaining  the 
value  VMPt  (A)  for  the  action  A  =  2  and  the  observation 
MPt  =  1. 

We  evaluated  each  method  by  computing  the  mean- 
squared-error  (MSE)  between  the  true  value  of  the  expec¬ 
tation  of  interest  (V^MPt  (A))  and  the  estimate  generated  us¬ 
ing  the  adaptive  sampling  method.  The  first  results  show 
how  the  methods  achieve  better  MSEs  with  fewer  samples 
for  this  problem.  We  only  show  results  for  those  methods 
that  were  the  most  competitive.  We  denote  by  “Var”  the 
method  based  on  the  minimization  of  the  variance,  and  by 
“L2  ”,  “KLl”,  and  “KLS”  the  methods  based  on  the  global 
minimization  of  L2,  KLi  and  KLa  respectively.  For  the 
update  methods  we  use  iV’(t)  =  1  for  all  t.  We  take  into 


Figure  3:  Average  mean  squared  error,  over  40  runs,  as  a 
function  of  the  number  of  samples  taken.  We  allow  LW 
twice  as  many  samples. 


account  that  the  update  methods  have  to  traverse  the  graph 
once  every  iteration  to  update  the  parameters  relevant  to  the 
sample  taken.  To  compensate  for  this  time,  we  allow  the  es¬ 
timate  based  on  LW  to  use  twice  as  many  samples.  Figure  3 
shows  the  results.  The  graph  shows  the  average  MSE  over 
40  runs  as  a  function  of  the  total  number  of  samples  taken 
(times  2  for  LW)  by  the  methods.  We  note  that  Var  and  L2 
achieve  better  MSEs  than  LW  and  converge  to  them  faster. 
With  significance  level  0.005  we  can  state  (individually) 
for  each  total  number  of  samples  N  =  50, 150, 250,  that 
Var  and  L2  (individually)  are  better  with  respect  to  MSE 
than  LW.  Also,  for  N  =  250,  KLS  is  better  than  LW. 

We  also  ran  the  methods  with  N{t)  =  50,  including  the 
local  heuristic  methods.  They  were  only  competitive  after 
a  larger  total  number  of  samples  (N  >  150).  Although  fur¬ 
ther  analysis  is  necessary,  we  would  like  to  convey  some 
general  observations.  We  believe  that  in  general  there  is  a 
tradeoff  in  the  setting  of  N{t)  and  /?.  We  note  that,  of  the 
updates  based  on  the  two  KL  versions,  KLl  typically  per¬ 
forms  better  than  KL2.  We  believe  this  is  because  the  error 
function  cku  is  defined  with  respect  to  the  optimal  sam¬ 
pling  distribution  while  eKL2  is  with  respect  to  the  current 
sampling  distribution.  KLS  seems  to  perform  better  than 
both.  L2  is  more  stable  than  any  of  the  other  methods,  sug¬ 
gesting  further  theoretical  analysis  which  we  are  currently 
undertaking.  Several  possible  reasons  for  this  behavior  are 
(1)  the  variance  of  the  gradient  might  be  smaller  than  in 
other  cases,  (2)  the  error  function  is  bounded,  and/or  (3) 
the  error  surface  might  be  smoother  than  in  other  cases.  We 
conjecture  that  L2  converges  to  a  stationary  point  of  . 

The  second  result  shows  that  the  update  methods  indeed 
lead  to  importance-sampling  distributions  with  smaller 
variance  relatively  quickly  for  this  problem.  Figure  4 


Figure  4:  Average  of  the  true  variance  of  the  weight  func¬ 
tion,  over  40  runs,  as  a  function  of  the  total  number  of  sam¬ 
ples  taken. 

shows  a  graph  of  the  true  variance  of  the  sampling  distribu¬ 
tion  learned  using  the  different  update  methods  as  a  func¬ 
tion  of  the  total  number  of  samples  used.  The  horizontal 
line  shows  the  variance  associated  with  the  sampling  dis¬ 
tribution  used  by  LW  (i.e.,  the  “prior”  distribution  of  the 
original  BN). 

These  experiments  are  all  carried  out  on  a  single  prob¬ 
lem.  Although  they  must  clearly  be  extended  to  a  variety 
of  larger  problems,  they  indicate  that  adaptive  importance¬ 
sampling  methods,  particularly  those  that  minimize  vari¬ 
ance  and  the  L2  norm,  can  lead  to  significant  improvements 
in  the  efficiency  of  sampling  as  a  method  for  computing 
large  expectations. 
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Abstract 

Sampling  has  become  an  important  strategy  for  inference  in 
belief  networks.  It  can  also  be  applied  to  the  problem  of 
selecting  actions  in  influence  diagrams.  In  this  paper,  we 
present  methods  with  probabilistic  guarantees  of  selecting  a 
near-optimal  action.  We  establish  bounds  on  the  number  of 
samples  required  for  the  traditional  method  of  estimating  the 
utilities  of  the  actions,  then  go  on  to  extend  the  traditional 
method  based  on  ideas  from  sequential  analysis,  generating  a 
method  requiring  fewer  samples.  Finally,  we  exploit  the  intu¬ 
ition  that  equally  good  value  estimates  for  each  action  are  not 
required,  to  develop  a  heuristic  method  that  achieves  major 
reductions  in  required  sample  size.  The  heuristic  method  is 
validated  empirically. 

Introduction 

The  problem  of  decision-making  involves  the  selection  of 
an  optimal  strategy,  A  strategy  determines  how  we  should 
act  based  on  observations  or  available  information  about  the 
variables  of  the  system  relevant  to  the  decision  problem. 
Posed  in  the  framework  of  decision  theory,  an  optimal  strat¬ 
egy  is  one  that  maximizes  our  utility.  The  utility  defines  our 
notion  of  value  associated  with  the  execution  of  actions  and 
the  states  of  the  system.  The  states  result  from  the  combina¬ 
tion  of  the  state  of  the  individual  variables  in  the  system.  In 
the  case  of  decision-making  under  uncertainty,  we  are  un¬ 
certain  about  both  the  state  of  the  system  and  the  result  of 
the  actions  we  take.  We  express  this  uncertainty  as  proba¬ 
bilities.  Therefore,  in  this  context  an  optimal  strategy  is  one 
that  maximizes  our  expected  utility. 

In  this  paper  our  main  interest  is  in  decision  problems  un¬ 
der  uncertainty  formulated  as  influence  diagrams  (ID).  An 
influence  diagram  is  a  graphical  model  that  provides  a  com¬ 
pact  representation  of  (1)  the  probability  distribution  gov¬ 
erning  the  states,  (2)  the  structural  strategy  model  represent¬ 
ing  how  we  make  decisions,  and  (3)  a  utility  model  defining 
our  notion  of  value  associated  with  actions  and  states.  We 
study  the  problem  of  selecting  an  optimal  strategy  in  an  in¬ 
fluence  diagram,  concentrating  on  the  case  in  which  there  is 
only  one  decision  to  be  made.  This  is  because  we  can  de¬ 
compose  the  problem  of  multiple  decisions  into  many  sub¬ 
problems  involving  single  decisions  (i.e.,  by  using  the  tech- 

Copyright  ©  2000,  American  Association  for  Artificial  Intelli¬ 
gence  (www.aaai.oig).  All  rights  reserved. 


Leslie  Pack  Kaelbling 

Artificial  Intelligence  Laboratory 
Massachusetts  Institute  of  Technology 
545  Technology  Square 
Cambridge,  MA  02139  USA 
lpk@ai .mit.edu 


nique  presented  by  Chames  &  Shenoy  (1999)).  We  note  that 
we  can  apply  methods  developed  to  solve  IDs  of  this  kind 
to  obtain  methods  to  solve  finite-horizon  Markov  decision 
processes  (MDPs)  and  partially  observable  Markov  decision 
processes  (POMDPs)  expressed  as  dynamic  Bayesian  net¬ 
works  (DBNs)  (i.e.,  by  modifying  the  technique  presented 
by  Kearns,  Mansour,  &  Ng  (1999)). 

The  problem  of  strategy  selection  involves  the  sub¬ 
problem  of  selecting  an  optimal  action^  from  the  set  of  action 
choices  available  for  that  decision, /or  each  possible  obser¬ 
vation  available  at  the  time  of  making  the  decision.  There¬ 
fore,  we  want  to  select  the  action  that  maximizes  the  ex¬ 
pected  utility  for  each  observation.  One  way  to  do  action 
selection  is  to  compute,  exactly  or  approximately,  the  prob¬ 
abilities  of  the  sub-states  of  the  system  directly  relevant  to 
our  utility  in  order  to  evaluate  the  expected  utility  or  value 
of  each  action.  A  sub-state  is  formed  from  the  state  of  a  sub¬ 
set  of  variables  in  the  system.  We  believe  this  approach  fails 
to  take  advantage  of  an  important  intuition:  it  only  matters 
which  action  is  best.  Therefore,  the  problem  of  action  selec¬ 
tion  is  primarily  one  of  comparing  the  values  of  the  actions. 
We  combine  this  with  the  intuition  that  actions  that  are  close 
to  optimal  are  also  good.  In  this  paper,  we  present  meth¬ 
ods  for  action  selection  in  IDs  that  take  advantage  of  these 
intuitions  to  make  major  gains  in  efficiency. 

Notation 

Before  we  present  the  definition  of  the  ID  model,  we  in¬ 
troduce  some  notation  used  throughout  the  paper.  We  de¬ 
note  one-dimensional  random  variables  by  capital  letters  and 
denote  multi-dimensional  random  variables  by  bold  capi¬ 
tal  letters.  For  instance,  we  denote  a  multi-dimensional 
random  variable  by  X  and  denote  all  its  components  by 
(Xi, . . .  ,  Xn)  where  Xi  is  the  one-dimensional  random 
variable.  We  use  small  letters  to  denote  assignments  to  ran¬ 
dom  variables.  For  instance,  X  =  a:  means  that  for  each 
component  Xi  of  X,  Xi  =  Xi,  We  also  denote  by  capi¬ 
tal  letters  the  nodes  in  a  graph.  We  denote  by  Pa{Y)  the 
parents  of  node  y  in  a  direct^  graph. 

We  now  introduce  notation  that  will  become  useful  during 
the  description  of  the  methods  presented  in  this  paper.  For 
any  function  h  with  variables  X  and  Z,  the  expression 


stands  for  a  function  f  over  variables  X  that  results  from 
setting  the  values  of  Z  in  h  with  assignment  z  while  letting 
the  values  for  X  remain  unassigned.  In  other  words, 


f{X)==h{X,Z)\^^^=h{X,Z  =  z). 

The  notation  Z  =  {S,  S')  means  that  the  variable  Z  is 
formed  by  all  the  variables  that  form  S  and  S'.  That  is,  Z  = 

=  (S,S'), 

where  n'  =  nj  -f  n2.  Note  that  we  are  assuming  that  the  set 
of  variables  forming  S  and  those  forming  S'  are  disjoint. 
The  notation  Z  ^  f  means  that  the  random  variable  Z  is 
distributed  according  to  probability  distribution  /.  We  de¬ 
note  a  sequence  of  samples  from  Z  by  z^^\ , . . . ,  where 
is  the  sample.  In  this  paper,  we  assume  that  the  sam¬ 
ples  are  independent. 


Definitions 

An  influence  diagram  (ED)  is  a  graphical  model  for  decision¬ 
making  (See  Jensen  (1996)  for  additional  information  and 
references).  It  consists  of  a  directed  acyclic  graph  along  with 
a  structural  strategy  model,  a  probabilistic  model  and  a  util¬ 
ity  model.  The  graph  represents  the  decomposition  used  to 
compactly  define  the  different  models.  Figure  1  shows  an 
example  of  a  general  graphical  representation  of  an  ID.  The 
vertices  of  the  graph  consist  of  three  types  of  nodes:  decision 
nodes,  chance  nodes  and  utility  nodes.  Decision  nodes  are 
square  and  represent  the  decisions  or  action  choices  in  the 
decision  problem.  Chance  nodes  are  circular  and  represent 
the  variables  of  the  system  relevant  to  the  decision  problem. 
Utility  nodes  are  diamonds  and  represent  the  utility  associ¬ 
ated  with  actions  and  states.  A  state  is  an  assignment  to  the 
variables  associated  with  the  chance  nodes  of  the  ID. 


Structural  strategy  model  The  structural  strategy  model 
defines  locally  the  form  of  a  decision  rule  for  each  decision 
node  Ai.  This  rule  is  a  function  of  (a  subset  of)  the  infor¬ 
mation  available  at  the  time  of  making  that  decision,  which 
is  contained  in  its  parents  Pa(At)  in  the  graph,  the  decision 
nodes  that  are  predecessors  of  decision  node  Ai  in  the  graph 
and  their  respective  parents.  The  example  ID  of  Figure  1  has 
only  one  decision  node.  Denote  a  strategy  for  our  example 
model  by  tt,  the  state  space  or  set  of  possible  assignments 
for  the  parents  of  the  action  node  by  flpa.{A)  the  set  of 
possible  actions  Qa-  Then,  a  policy  tt  :  fipa(>i) 


Probability  model  The  probability  model  compactly  de¬ 
fines  the  joint  probability  distribution  of  the  relevant  vari¬ 
ables  given  the  actions  taken  using  a  Bayesian  network  (BN) 
(See  Jensen  (1996)  for  additional  information  and  refer¬ 
ences).  The  model  defines  locally  a  conditional  probabil¬ 
ity  distribution  P{Xi  |  Pa(Xi))  for  each  variable  Xi  given 
its  parents  Pa(Xi  )  in  the  graph.  This  defines  the  following 
joint  probability  distribution  over  the  n  variables  of  the  sys¬ 
tem,  given  that  a  particular  action  a  is  taken: 

P{Xu . . .  ,  I  A  =  a)  =  nr=i  I  Pa(X0)U=„  . 


Figure  1:  General  structure  of  ID  we  consider. 


In  our  example  ID,  X  =  (S,  S\  O)  and,  since  there  is  only 
one  decision  node,  we  can  express  P{X  |  A  =  a)  as 

P(X|A  =  a)  =  P{S,S\0\A^a) 

-  P{S)P{S' \S,0,A^a)P{0\S), 

where 

pis)  =  nriimiPa(50),  cd 

P(5'|5,0,>l  =  a)  =  nr=imiPa(5'))U=„(2) 

PiO\S)  =  n”iim|Pa(Oi)).  (3) 

Utility  model  Finally,  the  utility  model  defines  the  utility 
associated  with  actions  resulting  from  the  decisions  made 
and  states  of  the  variables  in  the  system.  The  total  utility 
function  U  is  the  sum  of  local  utility  functions  associated 
with  each  utility  node.  For  each  utility  node  Uu  the  utility 
function  provides  a  utility  value  as  a  fiinction  of  its  parents 
Pa(C/i)  in  the  graph.  The  total  utility  can  be  express^  as 

U{X,A)  =  ZZ,Ui(Pe,{Ui)).  (4) 

Note  that  we  are  using  the  label  of  the  utility  node  to  also 
denote  the  utility  function  associated  with  it. 

In  this  paper  we  assume  that  the  variables  and  the  deci¬ 
sions  are  discrete  and  the  local  utilities  are  bounded.  In  ad¬ 
dition,  we  concentrate  on  IDs  with  one  decision  node  and  the 
general  structure  shown  in  Figure  1 .  The  results  in  this  paper 
are  still  valid  for  more  general  structural  decompositions  of 
the  probability  distribution.  We  use  the  structure  given  by 
the  ID  in  the  figure  to  simplify  the  presentation.  Also,  the 
results  allow  random  utility  functions. 


Value  of  a  strategy  The  value  of  a  strategy  tt  is  the 
expected  utility  of  the  strategy: 

V-  -  Y.xPix\^  =  <o))U{x,A^it(q)) 

-  Y.oY.s1:s^P{S.S\0\A  =  i:{0)) 

U{S,  S',0|>l  =  7r(0)). 

The  optimal  strategy  tt*  is  that  which  maximizes  over 
all  TT.  We  denote  the  value  of  the  optimal  strategy  by  V*. 

Note  that  we  can  decompose  this  maximization  into  max¬ 
imizations  over  the  set  of  actions  for  each  observation.  For 
each  assignment  to  the  observations  o,  we  define  the  value 
of  an  action  a  by 

Voia)  =  Es  Es'  P{S,  S',  O  =  o  I  A  =  a) 

[/(S,S',0  =  o|  A  =  a).  (5) 

Hence,  the  value  of  a  strategy  is  =  Eo^o(7r(0)), 
Note  that  this  is  not  the  traditional  definition  of  the  value 
of  an  action.  We  discuss  below  why  we  do  not  use  the  tradi¬ 
tional  definition. 

If  we  denote  by  a*  =  7r*(o)  the  action  that  maximizes 
Vo{a)  over  all  actions  a,  then  the  value  of  the  optimal  strat- 
egy  is  V*  =  ^o(7r*(0))  =  maxa  Vo(o).  Hence, 

the  problem  of  strategy  selection  r^uces  to  that  of  action 
selection  for  each  observation. 

Exact  methods  exist  for  computing  the  optimal  strategy 
in  an  ID  (See  Chames  &  Shenoy  (1999)  and  Jensen  (1996) 
for  short  descriptions  and  a  list  of  references).  However, 
this  problem  is  hard  in  general.  In  this  paper,  we  concen¬ 
trate  on  obtaining  approximations  to  the  optimal  strategy 
with  certain  guarantees.  Our  objective  is  to  find  policies  that 
are  close  to  optimal  with  high  probability.  That  is,  for  a 
given  accuracy  parameter  e*  and  confidence  parameter  J*, 
we  want  to  obtain  a  strategy  tt  such  that  V*  —  <  e*  with 

probability  at  least  1—S*.  Note  that  given  the  decomposition 
described  above,  if  we  obtain  actions  for  each  observation 
such  that  their  value  is  sufficiently  close  to  optimal  with  suffi¬ 
ciently  high  probability,  then  we  obtain  a  near-optimal  strat¬ 
egy  with  high  probability.  That  is,  let  I  be  the  number  of  pos¬ 
sible  assignments  to  the  observations.  If  for  each  observa¬ 
tion  o  we  select  action  d  such  that  K,  (a* ) — (d)  <  2e  with 
probability  at  least  1  —  S,  where  e  —  e*/(2Z)  and  6  =  S* /I, 
then  we  obtain  a  strategy  that  is  within  e*  of  the  optimal 
with  probability  at  least  1  —  S*.  Therefore,  we  concentrate 
on  finding  a  good  action  for  each  observation. 

Typically  the  value  of  an  action  is  defined  as  the  condi¬ 
tional  expected  utility  of  the  action  given  an  assignment  of 
the  observations.  If  we  denote  this  value  by  V (a  |  o),  we  can 
express  the  value  of  a  policy  as  =  Eo  P{0)V{7r{0)  \ 
O),  We  do  not  use  this  definition  because  it  is  harder  to 
obtain  estimates  for  V{a  \  o)  with  guaranteed  confidence 
bounds  than  it  is  to  obtain  estimates  for  14(a). 

Multiple  Comparisons  with  the  Best:  Results 

There  are  two  important  results  from  the  field  of  multi¬ 
ple  comparisons  and  in  particular  from  the  field  of  multi¬ 
ple  comparisons  with  the  best  that  we  take  advantage  of 
in  this  paper.  These  results  are  based  on  the  work  of  Hsu 


(1981)  (See  Hsu  (1996)  for  more  information).  Before  we 
present  the  results  we  introduce  the  following  notation:  de¬ 
note  =  max(x,  0)  and  =  min(0,  x).  The  first 
result  is  known  as  Hsu's  single-bound  lemma,  which  is  pre¬ 
sented  as  Lemma  1  by  Matejcik  &  Nelson  (1995). 

Lemma  1  Let  /X(i)  <  p{2)  <  •••  <  P{k)  be  the  (un¬ 
known)  ordered  performance  parameters  ofk  systems,  and 
let  p{2) , . .  •  ,  A(fc)  be  any  estimators  of  the  parameters. 

If 

Pr{(i{k)  -  A(t)  -  (^(fc)  -  M(i))  >  -w,  i  =  ,k-l) 

=  l-a,  (6) 

then 

Pr{pi  -  maXjjki  pj  €  [-(jH  -  max^yi  pj  -  tt;)”, 

(pi  —  maxjyf  pj  -f  for  alH}  >  1  —  a.  (7) 
If  we  replace  the  =  in  (6)  with  >,  then  (7)  still  holds. 

In  our  context,  we  let  for  each  action  a,  the  true  value 

A  aI. 

p^  =z  14(a)  and  the  estimate  pa  =  Vo{a).  Also,  the 
smallest  true  value  corresponds  to  That  is,  if  14 (oi)  < 
14(0-2)  <  •••  <  Vo{ak),  then  for  all  i,  p(^i)  =  14(ai)- 
Note  that  in  practice,  we  do  not  know  which  action  has 
the  largest  vsdue.  In  order  to  apply  Hsu’s  single-bound 
lemma,  we  obtain  the  bound  Pr{pj  -  pi  —  {pj  —  pi)  > 
—w,  for  all  i  ^  j}  >  1  —  a,  for  each  action  j,  individu¬ 
ally.  This  implies  that  Pr{p(^k)  ““  P(i)  -  {P{k)  P(t))  > 

—w,  i  =  1, . . .  ,  A:  —  1}  >  1  —  a,  which  allow  us  to  ap¬ 
ply  the  lemma.  Figure  2  graphically  describes  this  practi¬ 
cal  interpretation  of  the  lemma.  For  each  action  i,  individ¬ 
ually,  the  upper  bounds  on  the  true  differences,  drawn  on 
the  left-hand  side,  14(0  ^o{j)  <  14(0  “  ^o{j)  +  a;, 

for  each  j  ^  i,  hold  simultaneously  with  probability  at  least 
1  ~  a.  The  confidence  intervals,  drawn  on  the  right-hand 

side,  14(0  ■“  maxjyi  Vo{j)  G  [-(14(0  ”  niax^yi  14(j)  - 

1/;)“,  (14(0  ^(j)  +  ‘a;)'^),  for  each  action  i,  hold 

simultaneously  with  probability  at  least  1  —  a. 

The  second  result  allows  us  to  assess  joint  confidence  in¬ 
tervals  on  the  difference  between  the  value  of  each  action 
from  the  value  of  the  best  action  when  we  have  estimates  of 
the  differences  between  value  of  each  pair  of  actions  with 
different  degrees  of  accuracy.  The  result  is  known  as  Hsu's 
multiple-bound  lemma.  It  is  presented  as  Lemma  2  by  Mate¬ 
jcik  &  Nelson  (1995),  and  credited  to  Chang  &  Hsu  (1992). 
Lemma  2  Let  /X(i)  <  ^(2)  <  •  •  •  <  P(k)  be  the  (unknown) 
ordered  performance  parameters  ofk  systems.  Let  Tij  be 
a  point  estimator  of  the  parameter  pi  —  pj.  If  for  each  i 
individually 

Pr{Tij  -Wij ,  for  all  j  ^  0  =  1  ~  (8) 

then  we  can  make  the  joint  probability  statement 
Pr{p.i  -  maxj^i  pj  €  [D~ ,  D+),  for  all  i}  >  1  -  a,  (9) 

where  Df  =  (miiij,si[rij  +  g  =  {I  :  >  0}, 

and 

ro  ifa  =  {i} 

*  I  otherwise. 

If  we  replace  the  =  in  (8)  with  >,  then  (9)  still  holds. 
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Figure  2:  Graphical  description  for  practical  application  of 
Hsu’s  single-bound  lemma.  Note  that  the  “lower  bounds”  on 
the  left-hand  side  are  -oo. 


Upper  bounds  on:  ^Vq(I)-  Vq(3) 
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Figure  3:  Graphical  description  of  Hsu’s  multiple-bound 
lemma.  Note  that  the  “lower  bounds”  on  the  left-hand  side 
are  — oo. 


Figure  3  presents  a  graphical  description  of  this  lemma. 
Let,  for  all  actions  z,  Dj  and  ^  be  as  defined  in  Hsu’s 
multiple-bound  lemma,  with  fii  =  V^(z)  and  for  all  j  ^  z, 
Tij  =  Vo{i)  -  Vo{j).  For  each  action  i,  individually,  the 
upper  bounds  on  the  true  differences,  drawn  on  the  left-hand 
side,  Vo{i)—Vo(j)  <  for  each  j  ^  z,  hold  simulta¬ 

neously  with  probability  at  least  1  —  a.  The  confidence  inter¬ 
vals,  drawn  on  the  right-hand  side,  Vo{i)  -  Vo{j)  € 

[£)~,  £1+],  for  each  action  z,  hold  simultaneously  with  prob¬ 
ability  at  least  1  —  a.  Also,  in  this  example,  ^  =  {1, 2}.  In 
our  context,  Q  is  the  set  of  all  the  actions  that  could  poten¬ 
tially  be  the  best  with  probability  at  least  1  —  a.  That  is,  for 
each  action  a  in  Q,  the  upper  bound  on  the  difference  of 
the  true  value  of  action  a  and  the  best  of  all  the  other  actions, 
including  those  in  Q,  is  positive. 

Estimation-based  methods 

One  approach  to  selecting  the  best  action  is  to  obtain  esti¬ 
mates  of  14(a)  for  each  a  by  sampling,  using  the  probability 
model  of  the  ED  conditioned  on  a,  then  select  the  action  with 
the  largest  estimated  value. 

We  can  apply  the  idea  of  importance  sampling  (See 
Geweke  (1989)  and  the  references  therein)  to  this  estima¬ 
tion  problem  by  using  the  probability  distribution  defined  by 
the  ID  as  the  importance  function  or  sampling  distribution. 
This  is  essentially  the  same  idea  as  likelihood-weighting  in 
the  context  of  probabilistic  inference  in  Bayesian  networks 
(Shachter  &  Peot,  1989;  Fung  &  Chang,  1989).  We  present 
this  method  in  the  context  of  our  example  ID. 

First,  we  present  definitions  that  will  allow  us  to  rewrite 
Vo{a)  more  clearly.  First,  let  Z  =  (5,  S').  Define  the  target 


function  (in  our  case,  the  weighted  utilities) 

9a,o{Z)  =  pa^o(S,  S) 

=  P(S)P(S'|S,0  =  o,A=:a). 

P{0  =  o  I  S)U{S,  S\0  =  o,A:=a). 

Note  that  14(a)  =  Xlz^o,o(^)-  Define  the  importance 
function  as 

faAZ)  =  P{S)P{S'  1  S,  O  =  o,  A  =  a).  (10) 

Define  the  weight  function  u)a,o{Z)  =  ga,o{Z)l  fa,o{Z). 
Note  that  in  this  case, 

iJaAZ)  =  =  o  I  S)t/(S,  S',  O  =  o,  A  =  a).  (11) 

Finally,  note  that  V; (a)  =  Xlz  fa,o{Z){9a,o{Z)/fa,o{Z))- 
The  idea  of  the  sampling  methods  described  in  this  section  is 
to  obtain  independent  samples  according  to  /a,o»  use  those 
samples  to  estimate  the  value  of  the  actions,  and  finally  se¬ 
lect  an  approximately  optimal  action  by  taking  the  action 
with  largest  value  estimate.  Denote  the  weight  of  a  sample 

from  Z  ~  fa^o  as  u)a}o  ~  Then  an  unbiased 

estimate  of  \4(a)  is  14(a)  =  7^  ^a,o. 

IVaditional  Method 

We  can  obtain  an  estimate  of  14(a)  using  the  straightfor¬ 
ward  method  presented  in  Algorithm  1 ;  it  requires  parame¬ 
ters  Na^o  that  will  be  defined  in  Theorem  1. 

This  is  the  traditional  sampling-based  method  used  for 
action  selection.  However,  we  are  unaware  of  any  result 
regarding  the  number  of  samples  needed  to  obtain  a  near- 
optimal  strategy  with  high  probability  using  this  method. 


Algorithm  1  Traditional  Method 

1.  Obtain  independent  samples  . . .  ,  from 

Z  ^  /a,o* 

2.  Compute  the  weights  . . .  , 

3.  Output  Vo{a)  =  average  of  the  weights. 


Theorem  1  If  for  each  possible  action  2  =  1,...  we  es¬ 
timate  Vo{i)  using  the  traditional  method,  the  weight  func¬ 
tion  satisfies  k^o  <  and  the  estimate  uses 


samples,  then  the  action  with  the  largest  value  estimate  has 
a  true  value  that  is  within  2€  of  the  optimal  with  probability 
at  least  1  —  S. 

Proof  sketch.  The  proof  goes  in  three  basic  steps.  First, 
we  apply  Hoejfding  bounds  (Hoeffding,  1963)  to  obtain  a 
bound  on  the  probability  that  each  estimate  deviates  from  its 
true  mean  by  some  amount  e.  Then,  we  apply  the  Bonfer- 
roni  inequality  (Union  bound)  to  obtain  joint  bounds  on  the 
probability  that  the  difference  of  each  estimate  from  all  the 
others  deviates  from  the  true  difference  by  2e.  Finally,  we 
apply  Hsu’s  single  bound  lemma  to  obtain  our  result. 

Note  that  we  can  compute  and  ui^o  efficiently  from 
information  local  to  each  node  in  the  graph.  Assuming  that 
we  have  non-negative  utilities,  we  can  let 

=  [lljiimaxpaCO,)  I  Pa(Oj))|o=„] 

•  [Er=l  ^i(Pa(t{j))|o=o,A=i]  >  (^2) 

ko  =  [n;:iminpa(o,)P(0,|Pa(0,))|oJ 

•  [Er=i  Ui{V^{U^))\o=o,A=]  • 

However,  these  bounds  can  be  very  loose. 


Sequential  Method 

The  sequential  method  tries  to  reduce  the  number  of  samples 
needed  by  the  traditional  method,  using  ideas  from  sequen¬ 
tial  analysis.  The  idea  is  to  first  obtain  an  estimate  of  the 
variance  and  then  use  it  to  compute  the  number  of  samples 
needed  to  estimate  the  mean.  The  method,  presented  in  Al¬ 
gorithm  2,  requires  parameters  and  that  will  be 
defined  in  Theorem  2. 

Note  that  given  the  sequential  nature  of  the  method,  the 
total  number  of  samples  is  now  a  random  variable.  We  also 
note  that  while  multi-stage  procedures  of  this  kind  are  com¬ 
monly  used  in  the  statistical  literature,  we  are  only  aware 
of  results  based  on  restricting  assumptions  on  the  distribu¬ 
tion  of  the  random  variables  (i.e.,  parametric  families  like 
normal  and  binomial  distributions)  (Bechhofer,  Santner,  & 
Goldsman,  1995). 

Theorem  2  If  for  each  possible  action  t  =  1, . . .  ,  fc,  we 
estimate  Vo{i)  using  the  sequential  method,  the  weight func- 


Algorithm  2  Sequential  Method 

1.  Obtain  independent  samples  . . .  ,  from 

Z  ^  fayO> 

/j\  (2N'  ) 

2.  Compute  the  weights  Wo,o,  •  •  •  >  Wo,o  • 

3.  For  j  =  1, . . .  ,  let  yj  =  -  u^toYl2. 

4.  Compute  —  average  of  yj’s. 

5. UtNa,o  =  2N'^o  +  K:oKo)- 

6.  Obtain  „)  new  independent  samples 

z(2K.o+i),.;.,>..o)fromZ-/a,o. 

7.  Compute  the  new  weights  •  •  • , 

8.  Output  Vo{a)  =  average  of  the  new  weights. 


tion  satisfies  <  Ui^o,  =  Var[wi^o(Z)], 


N!  = 


2fcl 

2  22/3g4/3  S  ’ 


NU^lo)  = 


2d-?o  +  2(ui,o  -  li,o)e/3 


€4/3  S 


then  the  action  with  the  largest  value  estimate  has  a  true 
value  that  is  within  2e  of  the  optimal  with  probability  at  least 
1  —  (5.  Also, 


Ni,o  < 


I 

^ - ^473 - jlny  +  1 


O  max 


<^i,o  {Ui,o -koY^^ 


with  probability  at  least  1  —  5 /{2k),  and 

E[N,,o]  =  2Nl,  +  Ni[,{al^) 

-  O  fmax  -  ^ 


Proof  sketch.  The  only  difference  from  the  proof  of  The¬ 
orem  1  is  the  first  step.  Instead  of  using  Hoeffding  bounds 
to  bound  the  probability  that  each  estimate  deviates  from 
its  true  mean,  we  use  a  combination  of  Bernstein's  inequal¬ 
ity  (as  presented  by  Devroye,  Gyorfi,  &  Lugosi  (1996)  and 
credited  to  Bernstein  (1946))  and  Hoeffding  bounds  as  fol¬ 
lows.  We  first  use  the  Hoeffding  bound  to  bound  the  prob¬ 
ability  that  the  estimate  of  the  variance  after  taking  some 
number  of  samples  2iV'  deviates  from  the  true  variance  by 
some  amount  e'.  We  then  use  Bernstein’s  inequality  to 
bound  the  probability  that  the  estimate  we  obtain  after  taking 
some  number  of  samples  iV"  deviates  from  its  true  mean  by 


€  given  that  the  true  variance  is  no  larger  than  our  estimate 
of  the  variance  plus  e'.  We  then  find  the  value  of  e'  (in  terms 
of  e)  that  minimizes  the  total  number  of  samples  iV"  +  2N\ 
The  results  on  the  number  of  samples  follow  by  substituting 
the  minimizing  £'  back  into  the  expressions  for  AT"  and  N\ 
Steps  2  and  3  are  as  in  Theorem  L 

The  sequential  method  is  particularly  more  effective  than 
the  traditional  method  when 

Comparison-based  Method 

Using  the  results  from  MCB,  we  can  compute  simultaneous 
or  joint  confidence  intervals  on  the  difference  between  the 
value  of  Vo{o)  and  the  best  of  all  the  others  for  all  actions  a. 
Therefore,  MCB  allows  us  to  select  the  best  action  choice  or 
an  action  with  value  close  to  it,  within  a  confidence  level. 

In  the  previous  section  we  presented  methods  that  require 
that  we  have  estimates  with  the  same  precision  in  order  to 
select  a  good  action.  Hsu’s  multiple-bound  lemma  applies 
when  we  do  not  have  estimates  of  V^(a)  for  each  a  with  the 
same  precision.  Based  on  this  result,  we  propose  the  method 
presented  in  Algorithm  3  for  action  selection. 


Algorithm  3  Comparison-based  Method _ 

1.  Obtain  an  initial  number  of  samples  for  each  action  a. 

2.  Compute  MCB  confidence  intervals  on  the  difference 
in  value  of  each  action  from  the  best  of  the  other  actions 
using  those  samples. 

while  not  able  to  select  a  good  action  with  high  certainty 

do 

3(a).  Obtain  additional  samples. 

3(b).  Recompute  MCB  confidence  intervals  using  total 
samples  so  far. 


We  compute  the  MCB  confidence  intervals  heuristically. 
To  do  this,  we  approximate  the  precisions  that  satisfy  the 
conditions  required  by  Hsu’s  multiple-bound  lemma  (l^ua- 
tion  8)  using  Hoeffding  bounds  (Hoeffding,  1963).  Using 
this  approach,  for  each  pair  of  actions  i  and  j,  and  val¬ 
ues  lij^o  and  Uij^o  such  Aat  Uj^o  <  <  Uij^o  and 

hj.o  ^  we  approximate  Wij  as 


where  Ni^o  is  the  number  of  samples  taken  for  action  i  thus 
far.  We  then  use  these  approximate  precisions  and  the  value- 
difference  estimates  to  compute  the  MCB  confidence  inter¬ 
vals  (as  specified  by  Equation  9).  There  are  alternative  ways 
of  heuristically  approximating  the  precisions  but,  in  this  pa¬ 
per,  we  use  the  one  above  for  simplicity. 

Once  we  compute  the  intervals,  the  stopping  condition  is 
as  follows.  If  at  least  one  of  the  lower  bounds  of  the  MCB 
confidence  intervals  is  greater  than  — 2e,  then  we  stop  and 
select  the  action  that  attains  this  lower  bound.  Otherwise, 
we  continue  taking  additional  samples. 

We  define  the  value  of  initial  number  of  samples  in  our  ex¬ 
periments  as  40.  When  taking  additional  samples,  we  use  a 
sampling  schedule  that  is  somewhat  selective  in  that  it  takes 


more  samples  from  more  promising  actions  as  suggested  by 
the  MCB  confidence  intervals.  We  find  the  action  whose 
corresponding  MCB  confidence  interval  has  an  upper  bound 
greater  than  0  (i.e.,  from  the  set  G  as  defined  in  Hsu’s  multi¬ 
ple  bound  lemma)  and  whose  lower  bound  is  the  largest.  We 
take  40  additional  samples  from  this  action  and  10  from  all 
the  others.  We  understand  that  these  sample  sizes  are  very 
arbitrary.  Potentially,  other  setting  of  these  sample  sizes  can 
be  more  effective  but  we  did  not  try  to  optimize  them  for  our 
experiments.  Algorithm  4  presents  a  detailed  description  of 
the  instance  of  the  method  we  used  in  the  experiments. 


Algorithm  4  Algorithmic  description  of  the  instance  of  the 
comparison-based  method  used  in  the  experiments. _ 


for  each  observation  o  do 
1^1 

for  each  action  i  =  1, , . .  ,  fc  do 

Compute  Ui^o  and  li^o  using  equations  12  and  13, 
respectively. 

D-  ^  _oo;  iV^  ^  40;  ^  0;  V;(i)  ^  0. 

for  each  pair  of  actions  (i,  j),  i  ^  j  do 

^ —  uniax('Ut^o5  U'jjo)*  hj,o  ^  max(/i^oj 
while  there  is  no  action  i  such  that  D~  >  —26  do 
for  each  action  i  do 

Obtain  samples  . . .  , 

from  Z  ^  fi^o^  as  in  equation  10. 


Compute  weights 


(Wi, »+<'>) 


VoiV)  ^  (A^i,oVo(i)  +  Efe  + 

nZ)- 

Ni^o  ^ 


for  each  pair  of  actions  (i ,  j),  i  ^  j  do 

Compute  Wij  using  equation  14;  wji  4—  Wij. 
for  each  action  i  do 

Compute  G,  and  Df  using  Hsu’s  multiple- 
bound  lemma, 
for  each  action  i  do 

if  Df  ==  maxjeg  DJ  then  ^  40 

else  ^  10. 

I  < —  / 

7r(o)  4—  argmax^  D“ . 


Although  this  method  may  seem  well-grounded,  we  are 
not  convinced  that  the  bounds  hold  rigorously.  The  preci¬ 
sions  are  correct  if  the  samples  obtained  so  far  for  each  ac¬ 
tion  are  independent.  However,  this  might  not  be  the  case, 
since  the  number  of  samples  gathered  on  each  round  de¬ 
pends  on  a  property  of  the  previous  set  of  samples  (that  is, 
that  the  lower-bound  condition  did  not  hold).  It  is  not  yet 
clear  to  us  whether  the  fact  that  the  number  of  samples  de¬ 
pends  on  the  values  of  the  samples  implies  that  the  samples 
must  be  considered  dependent. 


Related  Work 

Chames  &  Shenoy  (1999)  present  a  Monte  Carlo  method 
similar  to  our  “traditional  method.”  One  difference  is  that 
they  use  a  heuristic  stopping  rule  based  on  a  normal  approx¬ 
imation  (i.e.,  the  estimates  have  an  asymptotically  normal 
distribution).  Their  method  takes  samples  until  all  the  esti¬ 
mates  achieve  a  required  standard  error  to  provide  the  cor¬ 
rect  confidence  interval  on  each  value  under  the  assumption 
that  the  estimates  are  normally  distributed  and  the  estimate 
of  the  variance  is  equal  to  the  true  variance.  They  do  not 
give  bounds  on  the  number  of  samples  needed  to  obtain  a 
near-optimal  action  with  the  requir^  confidence.  We  refer 
the  reader  to  Chames  &  Shenoy  (1999)  for  a  short  descrip¬ 
tion  and  references  on  other  similar  Monte  Carlo  methods 
for  IDs. 

Bielza,  Muller,  &  Insua  (1999)  present  a  method  based 
on  Markov-Chain  Monte  Carlo  (MCMC)  for  solving  IDs. 
Although  their  primary  motivation  is  to  handle  continuous 
action  spaces,  their  method  also  applies  to  discrete  action 
spaces.  Because  of  the  typical  complications  in  analyzing 
MCMC  methods,  they  do  not  provide  bounds  on  the  number 
of  samples  needed.  Instead,  they  use  a  heuristic  stopping 
rule  which  does  not  guarantee  the  selection  of  a  near-optimal 
action.  Other  MCMC-based  methods  have  been  proposed 
(See  Bielza,  Muller,  &  Insua  (1999)  for  more  information). 

Empirical  results 

We  tried  the  different  methods  on  a  simple  made-up  ID. 
Given  space  restrictions  we  only  describe  it  briefly  (See  Or¬ 
tiz  (2000)  for  details).  Figure  4  gives  a  graphical  representa¬ 
tion  of  the  ID  for  the  computer  mouse  problem.  The  idea  is 
to  select  an  optimal  strategy  of  whether  to  buy  a  new  mouse 
(A  =  1),  upgrade  the  operating  system  {A  =  2),  or  take 
no  action  (A  =  3).  The  observation  is  whether  the  mouse 
pointer  is  working  {MPt  =  1)  or  not  {MPt  =  0).  The  vari¬ 
ables  of  the  problem  are  the  status  of  the  operating  system 
(05),  the  status  of  the  driver  (D),  the  status  of  the  mouse 
hardware  (MH),  and  the  status  of  the  mouse  pointer  (MP), 
all  at  the  current  and  future  time  (subscripted  by  t  and  t  -f  1). 
The  variables  are  all  binary. 

The  probabilistic  model  encodes  the  following  informa¬ 
tion  about  the  system.  The  mouse  is  old  and  somewhat  un¬ 
reliable.  The  operating  system  is  reliable.  It  is  very  likely 
that  the  mouse  pointer  will  not  work  if  either  the  driver  or 
the  mouse  hardware  has  failed.  Table  1  shows  the  utility 
function  U {MPt+u  the  values  of  the  actions  and  ob¬ 

servations  Vo  {A)  computed  using  an  exact  method.  From 
Table  1  we  conclude  that  the  optimal  strategy  is:  buy  a 
new  mouse  (A  =  1)  if  the  mouse  pointer  is  not  working 
{MPt  =  0);  take  no  action  {A  =  3)  if  the  mouse  pointer  is 
working  {MPt  =  1).  This  strategy  has  value  26.50. 

Table  2  presents  our  results  on  the  effectiveness  of  the 
sampling  methods  for  this  problem.  We  set  our  final  de¬ 
sired  accuracy  for  the  output  strategy  to  e*  =  5  and  con¬ 
fidence  level  6*  =  0.05.  This  leads  to  the  individual  ac¬ 
curacy  2e  =  2.5  and  confidence  level  S  =  0.025  for 
each  subproblem.  We  executed  the  sequential  method  and 
the  comparison-based  method  100  times.  The  Comparison- 


Figure  4:  Graphical  representation  of  the  ID  for  the  com¬ 
puter  mouse  problem. 
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MPt^i 

MPt 

A 

0 

1 

0 

1 

1 

0 

40 

18.20 

6.60 

2 

5 

45 

7.54 

7.39 

3 

10 

50 

10.57 

8.30 

Table  1:  This  table  presents  the  utility  function  and  the 
(exact)  value  of  actions  and  observations  for  the  computer 
mouse  problem. 

based  method  produces  major  reductions  in  the  number  of 
samples.  When  we  observe  the  mouse  pointer  not  working, 
The  comparison-based  method  always  selects  the  optimal 
action  of  buying  a  new  mouse.  When  we  observe  the  mouse 
pointer  working,  The  comparison-based  method  failed  to  se¬ 
lect  the  optimal  action  of  taking  no  action  4  times  out  of  the 
100.  In  those  cases,  it  selected  the  next-to-optimal  action 
of  upgrading  the  operating  system  {A  =  2).  This  action 
is  within  our  accuracy  requirements  since  the  difference  in 
value  with  respect  to  the  optimal  action  is  0.91. 

The  comparison-based  method  is  highly  effective  in  cases 
where  there  is  a  clear  optimal  action  to  take.  For  instance, 
in  the  computer  mouse  problem,  buying  a  new  mouse  when 
we  observe  the  mouse  not  working  is  clearly  the  best  option. 
The  differences  in  value  between  the  optimal  action  and  the 
rest  are  not  as  large  as  when  we  observe  the  mouse  working. 

In  this  problem,  the  results  for  the  sequential  method 
should  not  fully  discourage  us  from  its  use,  because  the  vari¬ 
ances  are  still  relatively  large.  We  have  seen  major  reduc¬ 
tions  in  problems  where  the  variance  is  significantly  smaller 
than  the  square  of  the  range  of  the  variable  whose  mean  we 
are  estimating. 

Summary  and  Conclusion 

The  methods  presented  in  this  paper  are  an  alternative  to 
exact  methods.  While  the  running  time  of  exact  methods 
depends  on  aspects  of  the  structural  decomposition  of  the 


Method  1 

A 

MPt 

Traditional 

Sequential 

Comp-based 

1 

0 

2403 

3802  (188) 

335  (151) 

2 

0 

3007 

2266  (142) 

115(37) 

3 

0 

3679 

2426 (129) 

118(39) 

1 

1 

2213 

2508  (178) 

521  (216) 

2 

1 

2794 

2969  (201) 

695  (421) 

3 

1 

3443 

3468  (202) 

1361  (560) 

Total 

17539 

17438  (434) 

3145  (809) 

Table  2:  Number  of  samples  taken  by  the  different  methods 
for  each  action  and  observation.  For  the  sequential  and  the 
comparison-based  methods,  the  table  displays  the  average 
number  of  samples  over  100  runs.  The  values  in  parenthesis 
are  the  sample  standard  deviations. 


ID,  the  running  time  of  the  methods  presented  in  this  paper 
depends  primarily  on  the  range  of  the  weight  functions,  the 
variance  of  the  value  estimators  and  the  amount  of  separa¬ 
tion  between  the  value  of  the  best  action  and  that  of  the  rest 
(in  addition  to  the  natural  dependency  on  the  number  of  ac¬ 
tion  choices,  and  the  precision  and  confidence  parameters). 
In  some  cases,  we  can  know  in  advance  whether  they  will 
be  faster  or  not.  The  methods  presented  in  this  paper  can  be 
a  useful  alternative  in  those  cases  where  exact  methods  are 
intractable.  How  useful  depends  on  the  particular  character¬ 
istics  of  the  problem. 

Sampling  is  a  promising  tool  for  action  selection.  Our 
empirical  results  on  a  small  ID  suggest  that  sampling  meth¬ 
ods  for  action  selection  are  more  effective  when  they  take 
advantage  of  the  intuition  that  action  selection  is  primarily 
a  comparison  task.  We  look  forward  to  experimenting  with 
IDs  large  enough  that  sampling  methods  are  the  only  poten¬ 
tially  efficient  alternative.  Also,  our  work  leads  to  the  study 
of  adaptive  sampling  as  a  way  to  improve  the  effectiveness 
of  sampling  methods  (Ortiz  &  Kaelbling,  2000). 
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Abstract 

The  focus  of  this  work  is  the  computation  of  efficient 
strategies  for  commodity  trading  in  a  midti-market  en¬ 
vironment.  In  today’s  “global  economy”  commodities 
are  often  bought  in  one  location  and  then  sold  (right 
away,  or  zifter  some  storage  period)  in  different  mar¬ 
kets.  Thus,  a  trading  decision  in  one  location  must 
be  based  on  expectations  about  future  price  curves  in 
all  other  relevant  markets,  and  on  current  and  future 
storage  and  transportation  costs.  Investors  try  to  com¬ 
pute  a  strategy  that  maximizes  expected  return,  usu¬ 
ally  with  some  limitations  on  assumed  risk. 

With  standard  stochastic  assumptions  on  commod¬ 
ity  price  fluctuations,  computing  an  optimal  strategy 
can  be  modeled  as  a  Markov  decision  process  (MDP). 
However,  in  general  such  a  formulation  does  not  lead 
to  efficient  algorithms.  In  this  work  we  propose  a 
model  for  representing  the  multi-market  trading  prob¬ 
lem  and  show  how  to  obtain  efficient  structured  algo¬ 
rithms  for  computing  optimal  strategies  for  a  number  of 
commonly  used  trading  objective  functions  (Expected 
NPV,  Mean- Variance,  and  Value  at  Risk). 

Introduction 

Investment  is  the  act  of  incurring  immediate  cost  in 
the  expectation  of  future  reward.  Investment  options 
represent  various  tradeoffs  between  risk  and  expected 
profit.  Investors  try  to  maximize  their  expected  re¬ 
turn  subject  to  the  risk  level  that  they  are  willing  to 
assume.  Modern  economics  theory  models  the  uncer¬ 
tainty  of  future  rewards  as  a  stochastic  process  defining 
future  price  curves.  The  process  is  typically  Markovian, 
thus  investment  decision  can  be  modeled  as  a  Markov 
decision  process  (MDP)  (Bellman  1957;  Howard  1960; 
Puterman  1994)  where  a  state  of  the  underlying  process 
needs  only  to  include  the  current  investment  portfolio 
and  current  prices.  While  the  MDP  gives  a  succinct  for¬ 
malization  of  the  investment  decision  processes  it  does 
not  necessarily  imply  efficient  algorithms  for  computing 
optimal  strategies.  A  challenging  goal  in  this  research 
area  is  to  characterize  special  cases  of  the  general  in¬ 
vestment  paradigm  that  are  interesting  enough  from 
the  application  point  of  view  while  simple  enough  to 
allow  efficiently  computable  analytic  solutions. 


We  focus  in  this  paper  on  commodity  trading.  Past 
work  has  mainly  dealt  with  single  market  trading  prob¬ 
lems  (see  (Dixit  &  Pindyck  1994;  Hauskrecht,  Pan- 
durangan,  &  Upfal  1999)  and  the  references  there), 
where  commodity  is  bought,  stored  and  eventually  sold 
at  the  same  location.  Here  we  address  a  more  realistic 
scenario  in  today’s  “global  economy”,  that  of  a  multi¬ 
site  trading  problem  where  a  commodity  can  be  bought 
in  one  location,  stored  at  a  second  location  and  eventu¬ 
ally  sold  at  a  third  market.  Prices  at  different  locations 
may  be  different,  and  they  may  have  different  future 
price  curves.  Transportation  costs  also  vary  in  time. 
While  there  can  be  large  gaps  in  spot  prices  in  different 
locations,  future  prices  are  more  correlated  -  the  fu¬ 
ture  price  of  the  commodity  at  site  X  cannot  be  larger 
than  the  price  at  site  Y  plus  the  cost  of  transportation 
between  Y  and  X.  Trading  in  a  “global  economy”  is  sig¬ 
nificantly  more  complex,  since  a  local  trading  decision 
must  be  based  on  expectations  about  future  price  curves 
in  all  other  relevant  markets,  as  well  as  transportation 
and  storage  costs. 

Modeling  the  multi-site  commodity  trading  as  a 
Markov  decision  process  leads  to  a  large  state  space, 
and  a  large  action  space.  Nevertheless,  we  show  in  this 
work  that  under  several  commonly  used  trading  util¬ 
ity  functions  an  optimal  strategy  can  still  be  computed 
efficiently. 

A  standard  assumption  in  mathematical  economics 
is  that  commodity  prices  (e.g.,  oil  and  copper)  are  best 
modeled  as  a  mean  reverting  stochastic  process  (Dixit 
&  Pindyck  1994).  In  our  case,  prices  in  all  locations 
follow  the  mean  reverting  process  but  with  different  set 
of  parameters  for  different  sites.  To  solve  the  trading 
problem  we  first  consider  the  expected  net  present  value 
(ENPV)  objective  function,  where  the  goal  is  to  maxi¬ 
mize  expected  gain  with  no  consideration  to  risk.  Un¬ 
der  this  objective  function  the  optimization  problem 
becomes  myopic  and  can  be  computed  by  considering 
only  current  and  next  step  prices.  This  allows  us  to  de¬ 
sign  global  optimal  portfolio  allocation  algorithms  that 
are  polynomial  in  the  number  of  sites  in  each  trading 
step. 

Building  on  the  myopic  property  of  the  ENPV  objec¬ 
tive  function  we  extend  the  result  to  two  commonly 


used  objective  functions  that  combine  ENPV  maxi¬ 
mization  with  limits  on  assumed  risk  at  any  one  step. 
In  the  Mean-  Variance  function  the  goal  is  to  maximize 
a  weighted  difference  of  the  expected  gain  and  the  vari¬ 
ance.  The  Value  at  Risk  function  maximizes  expected 
gain  subject  to  a  (probabilistic)  limit  on  the  possibility 
of  a  large  loss  at  any  one  step.  Since  both  functions 
include  a  term  that  is  linear  in  the  variance  of  the  pro¬ 
cess,  the  optimization  problems  in  both  cases  lead  to 
a  constrained  quadratic  optimization  problem.  How¬ 
ever,  the  computational  complexities  of  the  two  prob¬ 
lems  are  different.  The  mean-variance  function  has  a 
particular  structure  that  allows  for  polynomial  time  so¬ 
lution.  The  complexity  of  the  optimization  problem 
for  the  value  at  risk  function  varies,  some  special  cases 
have  polynomial  time  solutions.  To  improve  the  com¬ 
putational  efficiency  of  both  methods  even  further  we 
present  structure-based  algorithms  exploiting  the  spe¬ 
cial  structure  and  regularities  of  the  problem. 

The  Model 

We  consider  investment  problems  with  one  type  of  com¬ 
modity  that  is  traded  at  n  different  sites.  Once  the  com- 
modidity  is  bought  it  can  be  either  stored  in  each  of  the 
locations  or  transported  between  any  two  locations. 

Price  model 

We  assume  that  trading  occurs  at  discrete  time  steps. 
To  model  commodity  price  fluctuations  we  adopt  a  dis¬ 
crete  time  version  of  the  mean-reverting  model  (Dixit 
k  Pindyck  1994): 

p{t+i)  _  ^  _  e“'^(/i  -  p^^))  -h  (1) 

where  p  is  the  long  term  average  price  of  the  commodity 
i.e.,  a  value  to  which  the  process  reverts,  r/  is  the  speed 
of  reversion  and  is  a  sequence  of  independent  ran¬ 
dom  variables  following  normal  distribution  A^(0,cre).^ 

Commodity  prices  at  all  locations  follow  mean  revert¬ 
ing  processes,  each  with  different  parameters  and  with 
possible  correlations  between  their  random  components 
e’s.  Their  combined  fluctuations  are  fully  described  by 
a  multivariate  normal  distribution  iV(0,  E),  with  a  zero 
mean  vector  and  a  covariance  matrix  E.  We  assume 
that  price  movements  are  independent  of  our  trading 
activities.  Also,  there  is  no  fee  for  trading  and  buy  and 
sell  prices  are  the  same.^ 

There  are  natural  capacity  constraints  on  the  number 
of  commodity  units  we  can  transport  (store)  between 

^  We  note  that  normally  distributed  random  components 
of  the  price  process  may  lead  to  negative  prices.  One  way  to 
deal  with  this  issue  is  to  use  a  geometric  version  of  the  meem 
reverting  process,  where  the  logarithm  of  the  price  follows 
the  mean  reverting  model.  However,  the  behavior  of  such 
a  model  is  quite  different,  and  price  curves  of  the  standard 
model  are  more  realistic. 

^In  the  more  general  setting  (not  considered  here)  prices 
can  also  fluctuate  based  on  our  demand  and  supply  for  the 
commodity  or  transportation  service. 


the  two  locations  at  any  time  step.  However,  there  are 
no  constraints  on  buy  and  sell  activities. 

Valuation 

Profit  is  measured  by  the  standard  expected  net  present 
value  (ENPV)  (see  e.g.  (Brealey  k  Myers  1991; 
Trigeorgis  1996)): 

T 

l^*(s)  =  £(5;]7‘m<‘)|7r,s)  (2) 

t=o 

where  s  denotes  an  initial  state,  tt  is  the  trading  strat- 
egy,  7  =  is  a  discount  factor,  with  r  denoting  the 
interest  rate  (present  cost  of  money),  T  is  the  decision 
horizon,  and  is  the  cash  flow  at  time  i.  We  focus 
primarily  on  problems  with  infinite  horizon  (T  ^  oc). 

Markov  decision  process  formulation  of 
the  problem 

A  Markov  decision  process  (MDP)  (Bellman  1957; 
Howard  1960;  Puterman  1994)  describes  a  stochastic 
controlled  process  represented  by  a  4-tuple  (5,  A,  T,  jB), 
where  5  is  a  set  of  process  states;  A  is  a  set  of  ac¬ 
tions;  T  :  5  X  A  X  5  [0, 1]  is  a  probabilistic  transition 

model  describing  the  dynamics  of  the  modeled  system; 
and  R:SxAxS-^lZ  models  rewards  assigned  to 
transitions. 

In  the  multi-site  commodity  trading  problem  the 
state  of  a  process  is  determined  by  a  price  vector 

P  ~  {PlfP2')  •  *  ’  ?  Pm  *  *  *  ?  Pnj  Pill  Pl2)  *  *  *  >  Pnn}) 

where  the  p,’s  give  the  commodity  price  at  location  z, 
the  Pij^s  give  the  transportation  price  from  i  to  j,  and 
the  pi/s  give  the  storage  price  at  site  i.  Actions  rep¬ 
resent  trading  activities  at  a  specific  time  step,  and  are 
defined  as 

a  =  {oil?  Ol2j  ’  *  *  ?  *  *  *  ^nn}) 

where  a,j  is  the  amount  of  commodity  to  be  trans¬ 
ported  between  i  and  j ,  or  stored  at  location  i  if  j  =  i. 
Thus,  actions  define  allocations  of  commodity  to  differ¬ 
ent  transportation  (storage)  edges.  ^ 

The  transition  model  is  defined  by  a  set  of  mean- 
reverting  price  functions  (Equation  1),  one  for  each  lo¬ 
cation.  For  example,  the  price  movements  for  location 
i  is 

Pi  =  Pi  -  {pi  -  Pi)  +  €i, 
where  pi  and  pj  is  the  current  and  next  step  price,  7]i 
and  Pi  are  the  parameters  of  the  mean-reverting  process 
and  €,•  is  the  random  component. 

^It  is  e£isy  to  see  that  the  number  of  units  to  be  trans¬ 
ported  between  different  locations  is  sufficient  to  define  all 
trading  activities.  Simply,  the  number  of  units  to  buy  and 
sell  at  different  locations  can  be  obtained  by  comparing  the 
number  of  units  currently  held  and  the  number  of  units  to 
be  transported  from  that  location  in  the  next  step. 


Rewards  represent  partial  profits  from  applying  the 
strategy  and  are  modeled  in  terms  of  step-wise  gains. 
The  gain  for  transporting  one  unit  of  commodity  be¬ 
tween  location  i  and  j  is  defined  by 

9ij(p)  =  -Pi  -  Pij  +  7Pi) 

where  pi  is  the  current  price  of  the  commodity  in  loca¬ 
tion  Pij  is  the  cost  of  transportation  and  Pj  is  the  price 
of  the  commodity  in  location  j  in  the  next  step.  The 
gain  for  an  action  a  that  allocates  commodity  to  differ¬ 
ent  transportation  edges  is  the  combination  of  partial 
gains 

n  n 

fla(p)  =  9{p)-^  =  '^'^9ij{p)aij. 
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Using  our  model,  a  sequence  of  cash  flows  for  any 
strategy  can  be  expressed  in  terms  of  step-wise  gains 
(rewards)  rather  than  actual  money  inflow  and  outflow. 
Intuitively,  we  can  replicate  payoffs  from  any  strategy 
by  buying  the  commodity  at  the  beginning  of  a  decision 
step  and  selling  it  at  the  end  of  that  step.  Therefore, 
the  expected  NP V  model  from  Equation  2  for  a  strategy 
TT  can  be  expressed  in  terms  of  gains  as 

T 

=  lim  S(5;;y5(‘)|7r,p),  (3) 

I  —¥00  ■  „ 

t=0 

where  is  the  gain  at  time  t.  This  is  exactly  the 
discounted,  infinite-horizon  criterion  used  commonly 
in  MDPs  (Puterman  1994).  Thus,  our  multi-site  in¬ 
vestment  problem  for  expected  NPV  model  can  be  ex¬ 
pressed  and  solved  as  a  Markov  decision  problem. 

The  optimal  trading  strategy  for  the  discounted,  infi¬ 
nite  horizon  Markov  decision  problem  is  stationary  (see 
(Bellman  1957;  Puterman  1994))  and  maps  states  of 

the  process  to  actions.  Therefore,  the  optimal  strat- 

2  2 

egy  for  our  problem  is  tt*  :  i?”  x  ii”  ii”  ,  map¬ 
ping  the  current  commodity  and  transportation  prices 
to  amounts  of  units  to  be  allocated  to  different  trans¬ 
portation/storage  edges. 

Solving  the  expected  NPV  problem 

Using  the  MDP  formulation,  Equation  3  for  the  ex¬ 
pected  NPV  model  and  a  fixed  policy  tt  can  be  rewritten 
in  Bellman’s  form  (Bellman  1957)  as 

/CO  poo 

•  V’{p')f{p'\p)dp', 

“00  */  —  oo 

(4) 

where  J5(fl^7r(p)(p))  is  the  expected  one-step  gain  for 
7r(p)  and  /(p'lp)  is  the  conditional  probability  density 
function  of  the  next  step  prices. 

Myopic  property 

We  see  that  V’*'(p)  is  hard  to  compute  exactly.  How¬ 
ever,  despite  this  difficulty  the  optimal  strategy  that 
maximizes  ENPV  can  be  computed  efficiently.  A  key 


feature  of  our  model  is  that  prices  change  indepen¬ 
dently  of  our  trading  decisions  (see  Equation  4).  Thus, 
the  optimal  policy  is  myopic  (a  greedy  one-step  policy 
is  globally  optimal)  and  can  be  easily  computed  (see 
(Hauskrecht,  Pandurangan,  &  Upfal  1999)). 

Theorem  1  The  optimal  trading  strategy  for  the  ex¬ 
pected  NPV  model  is  myopic. 

Proof  The  value  of  the  optimal  trading  strategy  is  ob¬ 
tained  from  Equation  4  by  maximizing  over  all  possible 
actions 

V*(p)  =  max  j^£'(pa(p))  +  7  y*  ’  '  J  V*{p)f{p'\p)dp 

As  the  next  step  prices  are  independent  of  the  action 
choice,  the  value  can  be  rewritten  as 

/oo  poo 

•••/  v^*(p')/(p'lp)c'p'- 

-oo  J  “OO 

We  see  that  in  order  to  get  the  optimal  solution  for  a  it 
is  sufficient  to  optimize  a  only  with  regard  to  E(5'a(p))‘ 
Thus  the  optimal  strategy  is  myopic.  □ 

The  myopic  property  of  the  optimal  investment  strat¬ 
egy  is  critical  for  computing  the  solution  for  the  com¬ 
modity  problem.  The  complete  optimal  investment 
strategy  tt  :  ii”  x  allocates  the  commodity 

units  to  different  transportation  edges  for  every  price 
vector  p.  As  the  number  of  possible  prices  and  corre¬ 
sponding  allocations  is  very  large,  it  is  not  feasible  to 
represent  and  store  the  optimal  policy. 

One  way  to  avoid  the  computation  of  the  complete 
policy  is  to  compute  individual  price-specific  allocations 
on-line.  The  on-line  algorithm  is  invoked  repeatedly  in 
every  step.  In  the  general  case,  the  on-line  phase  may 
be  very  time  consuming  as  it  may  require  to  examine 
multiple  price  trajectories  spanning  multiple  time  steps. 
The  myopic  property  of  the  decision  process  (Theorem 
1)  assures  that  we  can  obtain  the  optimal  solution  just 
by  looking  on  what  can  happen  in  the  next  step.  Sim¬ 
ply,  in  order  to  decide  the  best  allocation  of  investment 
for  some  price  vector  p  it  is  sufficient  to  choose  the  al¬ 
location  with  the  best  one-step  expected  gain,  and  it 
is  not  necessary  to  consider  more  distant  future  and 
possible  later  price  movements. 


Optimal  allocation 

To  find  the  optimal  trading  strategy  for  the  expected 
NPV  model  it  is  sufficient  to  optimize  expected  one- 
step  gains.  Let  a  be  some  allocation  of  units  to  different 
transportation  edges.  The  expected  gain  for  a  is 

n  n 
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To  maximize  the  expectation  we  need  to  maximize  the 
components  of  the  sum.  Assuming  that  Cij  is  the  con¬ 
straint  on  the  number  of  units  we  can  transport  between 
location  i  and  the  optimal  allocation  of  a,j  is  easy: 


if  >  0; 

otherwise. 


Simply,  we  invest  the  limit  to  every  edge  with  a  positive 
expected  gain. 


Objective  functions  with  one-step  risk 
models 

Once  risk  is  taken  into  account,  the  above  strategy  of 
investing  the  limit  on  all  edges  with  positive  expected 
gains  may  not  be  optimal  anymore. 

Investment  risk  can  be  incorporated  into  the  model  in 
various  ways.  We  focus  here  on  objective  functions  that 
penalize  or  bound  risk  in  any  single  step.  In  particular, 
we  investigate: 

•  Mean- Variance  model  (Markowitz  1991;  Alexander  fz 
Francis  1986;  Bodie,  Kane,  &  Marcus  1992)  that  ex¬ 
plicitly  relates  expected  one-step  gain  and  the  gain 
variance; 

•  Value  at  Risk  (VaR)  model  (Jorion  1996)  which  max¬ 
imizes  the  expected  present  value  of  the  investment, 
but  at  the  same  time  limits  possible  step  losses. 

The  important  property  of  both  models  is  that  their 
value  function  is  time-decomposable  and  can  be  ex¬ 
pressed  in  the  form  similar  to  the  expected  NPV  model 


V(p)  = 
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v*(p')/(p'|p)^p' 

^*(P')/(P'IP)^P'* 


Here,  h(ga(p))  is  a  function  of  a  one-step  gain  (a  ran¬ 
dom  variable),  not  just  its  expectation.  Different  risk 
models  use  different  forms  of  A.  Note  that  the  optimal 
policies  must  be  myopic  for  this  formalization. 


Mean- Variance  (MV)  model 

The  mean- variance  model  (Markowitz  1991;  Alexander 
&:  Francis  1986;  Bodie,  Kane,  &  Marcus  1992)  quantifies 
the  risk  in  terms  of  the  gain  volatility.  The  model  is 
additive  and  combines  the  expected  one-step  gain  and 
the  gain  volatility  into  a  single  objective  function  /ia(p): 

/ia(p)  =  aE(ga(p))  -  ^Var(ga(p)},  (6) 

where  a,  /3  >  0.  Intuitively  the  function  reflects  the  fact 
that  investors  like  the  mean  to  be  large  but  dislike  the 
variance.  Parameters  a,  /?  quantify  this  relation.  We 
note  that  this  valuation  corresponds  to  the  quadratic 
utility  function  (Markowitz  1991). 

Using  the  valuation  function  from  equation  6,  our 
goal  is  to  find  the  allocation  of  commodity  maximizing 
it.  That  is: 

7r*(p)  =  argmax[Q'i^(i?a(p))  -  ^^ar(5a(p))]  >  (7) 

subject  to  constraints  C,j  >  a,j  >  0  for  all  a,j.  The 
variance  of  the  gain  for  a  is: 

Var(ffa(p))  =  a’^S'a, 


Figure  1:  An  example  of  a  concave  quadratic  function 
for  two  dimensions. 


where  S'  is  the  gain  covariance  matrix  obtained  from 
the  price  covariance  matrix  S  as: 

=  Cov{gij{p),gkiip))  =  7^Cou(cj,c/)  =  7^Sj(. 

The  allocation  weights  in  a  must  be  non-negative 
since  there  is  no  meaning  in  our  model  to  negative  in¬ 
vestment.^  Also,  weights  Cij  should  have  only  integer 
values.  However,  to  simplify  the  problem  and  its  so¬ 
lution  we  approximate  the  integer  problem  by  allowing 
continuous  allocation  weights. 

Solution  for  the  model  Equation  7  defines  a 
quadratic  optimization  problem  with  linear  constraints. 
The  important  property  of  this  problem  is  that  the  h 
function  has  a  unique  global  optimum  solution.  We 
can  observe  this  from  the  fact  that  the  Hessian  of  our 
function  is  a  constant  negative  definite  matrix  (equal 
to  -2/?S').^  Therefore,  the  function  is  concave.  Fig¬ 
ure  1  illustrates  the  shape  of  the  function  for  the  2- 
dimensional  case.  This  special  case  of  Quadratic  Pro¬ 
gramming  is  known  to  have  a  polynomial  time  solution 
(Vavasis  1991). 

Exploiting  the  structure  Solving  the  optimization 
problem  requires  to  optimize  all  possible  allocation 
weights.  We  show  that  this  optimization  can  be  carried 
out  more  efficiently  by  taking  advantage  of  the  problem 
structure  and  by  solving  a  sequence  of  optimizations  of 
smaller  complexity. 

The  idea  of  our  solution  is  to  exploit  the  regularities 
of  the  covariance  matrix  S'  of  one-step  gains  for  all 
transportation  edges,  in  particular  the  fact  that  random 

^  We  note  that  in  some  of  the  problems  in  finance,  similar 
to  our  problem  (e.g.  portfolio  optimization),  constraints  on 
weights  can  be  lifted.  This  is  the  case  when  short-selling 
of  an  asset  or  security  is  possible.  In  that  case,  negative 
weights  in  the  portfolio  will  reflect  a  short  position. 

® Recall  that  the  covariance  matrix  S'  is  symmetric,  pos¬ 
itive  definite. 


components  of  transportation  links  leading  to  the  same 
location  are  fully  correlated.  Combining  this  property 
with  the  MV  criterion  makes  it  possible  to  find  the  opti¬ 
mal  allocation  incrementally.  The  idea  of  the  approach 
is  based  on  the  following  theorem. 

Theorem  2  Let  a*  be  the  optimal  allocation  of  com¬ 
modity  maximizing  expected  gains  (returns)  and  penal¬ 
izing  risk  (volatility).  Let,  (z,  j)  and  {k,j)  be  two  dif¬ 
ferent  transportation  links  ending  in  the  same  target  lo¬ 
cation  j  such  that  —pk  —  Pkj  <  ^Pi  —  Pij  holds.  Then 
alj  >  0  only  if  a* j  —  Cij,  otherwise  =  0. 

Proof  Gains  from  transporting  one  unit  of  commodity 
from  i  to  j  and  k  to  j  are 

9ij{p)  =  -Pi  -  Pij  +  7  -  e"'  (Pj  -  Pj)  +  Cj] 

fffei(p)  =  -Pk  -  Pkj  +  'y[pj-  {pj  -Pj)  +  cj] 

As  the  two  gains  share  the  same  stochastic  component 
and  their  difference  is  always  deterministic 

9ijiP)  -  9kj{p)  =  -Pi  -  Pij  -  [-Pk  -  Pkj]- 
Moreover  their  covariance  terms  in  E  are  the  same. 
Thus,  if  Pk  —  Pkj  >  ^Pi  —  Pij  ^  there  is  no  value  in  allo¬ 
cating  the  commodity  to  the  transport  link  choice  from 
k  before  we  allocate  the  maximum,  Cij,  to  Oij,  There¬ 
fore  if  alj  >  0,  a*j  must  be  saturated  {a*j  =  Qj),  By 
similar  argument,  a*j  <  Cij  implies  =  0.  □ 

By  using  this  result  we  can  perform  the  allocation 
of  commodity  to  different  transportation  edges  incre¬ 
mentally  by  allocating  commodity  to  edges  according 
to  their  expected  gains,  i.e.  edges  with  higher  expected 
gains  for  the  same  target  location  are  allocated  first. 
This  approach  translates  to  a  sequence  of  quadratic  op¬ 
timization  problems  with  at  most  n  variables. 

The  algorithm  works  as  follows:  the  optimization 
starts  by  considering  only  transportation  choices  with 
the  highest  expected  gains,  one  for  each  target  location. 
We  refer  to  these  edges  as  active  edges.  The  optimiza¬ 
tion  procedure  for  the  MV  model  is  then  applied  to  ac¬ 
tive  edges.  The  solution  gives  an  allocation  of  units  to 
all  active  edges.  During  the  optimization  a  transporta¬ 
tion  edge  can  reach  its  maximum  capacity;  we  say  that 
the  edge  becomes  saturated.  Once  an  edge  is  saturated 
it  is  removed  and  no  longer  considered  as  a  choice.  Af¬ 
ter  the  removal,  the  transportation  edge  with  the  next 
highest  expected  gain  (and  the  same  target  location) 
becomes  active  and  the  optimization  process  continues 
with  the  next  step.  This  is  repeated  until  all  edges  have 
been  exhausted  or  when  none  of  the  edges  were  satu¬ 
rated  in  the  last  step. 

The  optimization  steps  are  not  independent.  In  par¬ 
ticular,  every  optimization  step  must  take  into  consid¬ 
eration  results  of  all  previous  (partial)  allocations.  The 
dependencies  between  the  current  and  previous  steps 
are  summarized  by: 

•  a  vector  of  target  allocations  s  =  {si,  52?  •  ”  ?  re¬ 
flecting,  for  each  target  location,  the  number  of  units 
of  commodity  already  allocated  to  edges  incident  to 
that  location; 


•  adjusted  capacity  constraints  {Du,  •  •  •  ,  D„„}  repre¬ 
senting  the  remaining  capacity  of  all  edges,  i.e.,  the 
original  capacity  less  the  capacity  already  allocated 
in  all  previous  solutions. 

To  find  the  optimal  allocation  of  commodity  to  ac¬ 
tive  set  of  edges  we  solve  a  quadratic  program  (with 
n  variables).  Let  a  =  {ai,  02,  •  *  •  ?  dn}  denote  a  vector 
of  allocations  for  the  current  set  of  active  edges  and 
E{gj{p))  be  the  expected  gain  for  the  active  edge  for 
target  j.  Then  the  optimization  task  corresponds  to: 

max|a£'(5a(p)) -/3  [(s  +  a)^S(s  +  a)j|  (8) 

subject  to  constraints: 

Dj  >  d j  >  0  for  all  fij, 

where  ^(^^(p))  =  expected 

gain  for  the  portfolio  of  active  edges  and  S  is  the  re¬ 
duced  gain  covariance  matrix,  an  n  x  n  matrix  of  the 
gain  fluctuations  for  target  locations  {^ki  =  7^S/c/). 
Dj  denotes  an  adjusted  capacity  constraint  correspond¬ 
ing  to  the  transportation  link  for  a  target  location  j 
which  is  subject  to  optimization  (is  active). 

During  the  computation  process  we  keep  track  of  the 
number  of  units  allocated  to  each  transportation  link 
(staring  from  zero  allocations  at  the  beginning).  That 
is,  after  every  optimization  step  we  apply  the  following 
update: 

„  J  a*j  -h  dj  if  link  (z,  j)  is  active; 

Q.j  ^  ^  active. 

This  allows  us  to  recover  the  optimal  allocation  a*  at 
the  end.  In  addition,  we  update  s  quantities  and  adjust 
dynamic  capacity  constraints: 

r  Dij  -  dj  if  link  (z,  j)  is  active; 

^  \  Dij  if  not  active. 

Example  Figures  2,  3  and  Table  1  illustrate  and  com¬ 
pare  the  performance  of  strategies  for  different  optimal¬ 
ity  criteria  (the  Value  at  Risk  criterion  is  discussed  in 
the  next  section)  on  a  problem  with  5  trading  sites. 
Figure  2  shows  the  actual  step-wise  gains  obtained  for 
these  criteria  using  a  fixed  50-step  trajectory  of  prices’ 
fluctuations;  each  price  following  a  mean-reverting  pro¬ 
cess.  Table  1  summarizes  the  results  in  Figure  2  by 
showing  real  gain  averages  and  their  standard  devia¬ 
tions.  Finally,  Figure  3  compares  expectations  of  gains 
under  different  strategies.  We  see  that  ENPV  always 
leads  to  the  maximum  expected  gain  and  it  also  achieves 
higher  real  gains  on  average.  However,  step-wise  gains 
for  ENPV  are  also  subject  to  higher  fluctuations.  On 
the  other  hand.  Mean- Variance  (MV)  criterion  yields 
gains  that  fluctuate  less,  but  at  the  same  time  lead  to 
considerable  lower  expected  gains  and  also  real  gains  on 
average. 

Besides  the  experiments  shown  here,  we  have  tested 
the  performance  of  the  MV  model  for  different  combi¬ 
nations  of  parameters  a  and  jS,  As  expected,  higher 


Real  gains  for  ENVP,  MV  and  VaR 


Figure  2:  Comparison  of  three  different  optimization  criteria:  Expected  NPV  (ENPV),  Mean- Variance  (MV)  and 
Value  at  Risk  (VaR)  on  a  problem  with  5  trading  sites  and  50-step  trajectory  of  prices’  fluctuations,  each  following 
a  mean-reverting  process.  For  each  step  we  plot  the  real  gains  for  that  step.  The  parameters  of  the  MV  model  we 
use  are  a  =  1  and  P  =  0.01.  We  use  R'  =  0  and  S  =  0.0005  for  the  VaR  model. 


Expected  gains  for  ENVP,  MV  and  VaR 


Figure  3:  Comparison  of  expected  gains  for  three  different  optimization  criteria:  Expected  NPV  (ENPV),  Mean- 
Variance  (MV)  and  Value  at  Risk  (VaR)  on  a  problem  with  5  trading  sites  and  50  step  long  prices’  trajectories. 


ENPV 

MV 

VaR 

average  real  gains 

57.31 

13.69 

42.47 

standard  deviation 

70.64 

17.49 

60.54 

Table  1:  Average  of  the  real  gains  and  their  standard 
deviation  for  ENPV,  MV  and  VaR  criteria  and  data 
from  Figure  2. 


MV  time 


Figure  4:  Average  running  times  for  markets  with  vary¬ 
ing  number  of  trading  sites. 


values  of  /?  lead  to  smaller  average  gains  and  smaller 
gain  fluctuations.  Simply,  for  higher  values  of  /?  we 
penalize  the  variance  more  and  thus  we  are  likely  to 
sacrifice  the  opportunity  to  capture  higher  gains. 

One  concern  in  applying  our  approach  is  that  the 
optimization  is  carried  on-line  in  every  step,  and  thus 
it  may  lead  to  large  reaction  delays  for  larger  problems 
(with  many  trading  sites) .  To  see  the  effect  of  the  size  of 
the  multi-site  market  on  the  actual  running  time  of  the 
optimization  problem  we  ran  a  set  of  experiments,  vary¬ 
ing  the  number  of  trading  sites.  For  each  market  size 
we  ran  1000  different  parameter  settings  and  averaged 
them.  To  solve  the  quadratic  optimization  problem  we 
use  ISML  C/Math/Library  implementation  based  on 
(Goldfarb  k  Idnani  1983).  Figure  4  shows  average  run¬ 
ning  times,  obtained  for  different  market  sizes.  The 
running  time  (in  seconds)  increases  moderately  with  the 
number  of  sites.  In  particular,  the  solution  for  30  dif¬ 
ferent  trading  sites,  which  is  about  the  practical  limit, 
can  be  obtained  very  quickly  (in  about  3  seconds  on  a 
SUN  Ultra-10). 

Value  at  Risk  (VaR)  model 

Let  K  he  a  loss  threshold  and  S  the  maximum  probabil¬ 
ity  of  losing  K  or  more  units.  The  value  of  K  is  called 
the  value  at  risk  foi  S  (see  (Jorion  1996)). 

This  optimization  problem  has  the  form  of  Equation 
5,  where  we  maximize 

/»(5a(p))  =  ^(5a(p)) 


subject  to 

Cij  >  ciij  >  0  for  all  Qij 

P(fla(p)  <  -K)  <  S.  (9) 

This  is  a  linear  optimization  problem  with  linear 
and  quadratic  constraints.  Inequality  9  reduces  to  a 
quadratic  constraint  by  the  properties  of  the  normal 
distribution.  Let  x  he  a  normally  distributed  random 
variable  with  mean  /x  and  variance  cr^.  Let  k  he  a  value 
such  that  P{x  <  fx  —  ka)  <  S  holds.  The  value  of  k  mea¬ 
sures  the  distance  from  the  mean  in  terms  of  a  standard 
deviation  a,  such  that  values  smaller  than  fx  —  ka  occur 
with  probability  less  than  In  the  case  of  a  normal 
distribution,  k  is  only  a  function  of  ^ ,  and  it  is  indepen¬ 
dent  of  fx  and  cr.  Therefore,  in  order  to  limit  the  losses 
of  more  than  K  units  with  probability  1  (5,  we  set  the 

value  of  ks  such  that  it  satisfies  fx  —  ksc  >  —K, 

Therefore,  the  constraint  9  can  be  rewritten  as 

[^(<?a(p))  +  K]^  -  k^sVar{g^{p))  >  0  (10) 

which  is  quadratic  in  allocation  weights  a.  We  can 
rewrite  the  constraint  in  terms  of  mean  one-unit  gains 
(vector  /x)  and  covariances  (S')  as: 

[pm’’  -  a  +  2K(i'^a  +  K^>Q.  (11) 

Let  W  =  [pp^  —  ^|S']  be  the  x  matrix  defin- 
ing  the  quadratic  term.  We  note  that  if  the  matrix  W 
is  negative  definite,  the  problem  corresponds  to  the  lin¬ 
ear  optimization  over  the  convex  space.  Thus,  it  can 
be  solved  efficiently  in  polynomial  time  (Papadimitriou 
&  Stieglitz  1998).  However,  when  the  matrix  W  is  not 
negative  definite  we  have  a  non-convex  space  over  which 
we  optimize.  To  solve  this  problem  we  can  apply  stan¬ 
dard  augmented  Lagrangian  techniques  (see  e.g.  (Bert- 
sekas  1995)). 

Using  structure  to  solve  the  VaR  model  The 
optimization  of  VaR  criterion  can  be  performed  more 
efficiently  by  solving  a  sequence  of  optimization  prob¬ 
lems  of  smaller  complexity.  This  is  the  same  idea  as 
used  for  the  structured  solution  of  the  Mean- Variance 
model  and  Theorem  2  also  applies  to  this  case.  Simply, 
the  only  sources  of  stochasticity  are  price  fluctuations  at 
different  target  locations.  Thus,  if  two  different  trans¬ 
portation  edges  share  the  same  target  location,  their 
stochastic  component  is  the  same  and  for  the  rational 
and  risk  averse  investor  the  transportation  choice  with 
better  expected  gain  should  be  chosen  first.  Therefore, 
under  transportation  capacity  constraints,  the  global 
optimization  can  be  carried  incrementally  by  solving 
a  sequence  of  optimization  problems  with  n  variables, 
instead  of  the  optimization  with  variables.  The  glob¬ 
ally  optimal  solution  is  then  constructed  from  results  of 
partial  solutions. 

To  solve  the  problem,  we  optimize  repeatedly  the  (re¬ 
duced)  problem  with  n  variables: 

max£'(sfa(p)) 


subject  to, 


Dj  >  a j  >0,  for  all  aj; 


[5Af  +  jE'(t/a(p))  +  iii]^  -  *1  (s  +  a)^S(s  +  a) 


>  0. 


The  notation  used  and  the  basic  algorithm  applied  are 
the  same  as  in  the  Mean- Variance  case.  The  only  differ¬ 
ence  is  that  for  the  VaR  criterion  we  have  to  add  con¬ 
stant  SM  which  represents  the  sum  of  expected  gains 
for  all  previous  solution.  This  quantity  is  updated  dy¬ 
namically  after  every  step  and  is  needed  to  assure  that 
the  non-linear  constraint  is  not  violated  during  the  op¬ 
timization  process. 


Example  Figures  2,  3  and  Table  1  compare  the  VaR 
criterion  to  ENPV  and  MV  criteria  on  a  problem  with 
5  sites.  We  note  that  the  VaR  choices  do  not  penalize  a 
large  variance  when  expectation  is  also  high.  Instead, 
it  only  tries  to  limit  the  probability  of  losses.  Thus  the 
real  gains  obtained  for  the  VaR  model  vary  more  than 
those  of  the  MV  model  and  also  tend  to  achieve  higher 
gains  (both  under  expectation  and  on  average).  From 
the  graphs  we  observe  that  in  many  instances  the  alloca¬ 
tions  for  the  VaR  criterion  replicate  exactly  the  ENPV 
choices.  However,  in  some  instances,  when  a  chance  of 
losses  exceeds  the  confidence  threshold,  the  approach  is 
more  conservative  and  the  allocation  it  chooses  is  differ¬ 
ent.  For  example,  in  50  simulation  steps  in  Figure  2  the 
VaR  approach  (with  threshold  gain  0)  never  lead  to  the 
negative  gain,  while  there  are  seven  different  cases  of 
negative  gains  for  ENPV  and  two  for  the  MV  criterion. 


Conclusion 

We  addressed  the  complex  problem  of  finding  optimal 
strategies  for  trading  commodity  in  a  multi-market  en¬ 
vironment.  We  investigated  various  objective  criteria 
based  on  expected  net  present  value  (ENPV)  and  risk 
preferences  of  the  investor.  Different  criteria  can  lead 
to  optimization  problems  of  different  complexity.  We 
showed  that  under  the  assumption  of  equal  buy  and  sell 
prices,  a  number  of  criteria  lead  to  the  myopic  portfolio 
optimization  problem.  This  is  very  important  as  the 
computation  of  the  optimal  strategy  needs  to  take  into 
account  only  the  current  and  next  step  prices  and  not 
all  possible  future  price  trajectories. 

We  analyzed  and  solved  the  problem  for  the  expected 
NPV  criterion  and  two  commonly  used  risk-based  cri¬ 
teria:  Mean- Variance  and  Value  at  Risk  models.  We 
showed  that  in  both  risk-based  models  the  optimiza¬ 
tion  problem  reduces  to  some  form  of  the  quadratic  op¬ 
timization  problem.  To  further  improve  the  efficiency 
of  the  solution  we  exploited  the  structure  of  the  co- 
variance  matrix,  in  particular  the  fact  that  gains  for 
the  same  target  locations  are  fully  correlated.  This  al¬ 
lowed  us  to  reduce  a  large  optimization  problem  for 
both  risk-based  criteria  into  a  sequence  of  problems  of 
smaller  complexity.  The  empirical  results  obtained  for 
the  mean- variance  and  value  at  risk  models  support  the 
feasibility  of  the  solution  and  its  practical  applicability. 


We  note  that  our  results  and  algorithms  can  be  ap¬ 
plied  directly  to  any  multi-site  model  in  which  the  next- 
step  price  fluctuations  are  normally  distributed,  and 
thus  not  necessarily  mean-reverting.  The  current  model 
can  be  extended  in  a  number  ways.  For  example,  in¬ 
teresting  issues  will  arise  if  we  refine  the  market  mod¬ 
els  and  extend  them  to  include  price  spreads,  trading 
(buy,  sell)  constraints,  prices  sensitive  to  supply  and  de¬ 
mands,  etc.  Another  interesting  direction  is  the  inves¬ 
tigation  and  application  of  more  complex  risk  models, 
reflecting  different  preferences  of  an  investor. 
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